CN110399458B - Text similarity calculation method based on latent semantic analysis and random projection - Google Patents
Text similarity calculation method based on latent semantic analysis and random projection Download PDFInfo
- Publication number
- CN110399458B CN110399458B CN201910598004.7A CN201910598004A CN110399458B CN 110399458 B CN110399458 B CN 110399458B CN 201910598004 A CN201910598004 A CN 201910598004A CN 110399458 B CN110399458 B CN 110399458B
- Authority
- CN
- China
- Prior art keywords
- sim
- weight vector
- entry
- tag
- obtaining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 19
- 238000004458 analytical method Methods 0.000 title claims abstract description 15
- 239000013598 vector Substances 0.000 claims abstract description 79
- 238000000034 method Methods 0.000 claims abstract description 61
- 238000012545 processing Methods 0.000 claims abstract description 14
- 238000004806 packaging method and process Methods 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 6
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text similarity calculation method based on latent semantic analysis and random projection, which is suitable for the problem of general unsupervised text clustering. The method comprises the steps of firstly converting a label text to be processed into a word bag model, and weighting the label text by using a TF-IDF algorithm to obtain a weight vector set. And then processing the weight vector set by using an LSA algorithm to obtain an LSA index library, and processing the weight vector set by using a random projection algorithm to obtain an RP index library. Finally, after TF-IDF processing is carried out on the corpus to be calculated, LSA algorithm and RP algorithm are respectively used for processing and then the corpus to be calculated is compared with the content of the index library, and the text similarity is obtained. The invention can calculate the effective similarity of the text content and recommend the related content through the text with higher similarity.
Description
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a text similarity calculation method based on latent semantic analysis and random projection.
Background
In the conventional text recommendation algorithm, researchers choose to perform similarity calculation through a common vector space model or the like. By exploring semantic association among texts, a reliable text similarity calculation method is provided for a related system by combining latent semantic analysis and random projection.
Latent semantic analysis:
the potential semantic analysis is a new information retrieval algebraic model, is a calculation theory and method for knowledge acquisition and display, and uses a statistical calculation method to analyze a large number of text sets, thereby extracting potential semantic structures among words, and using the potential semantic structures to represent the words and the text, so as to achieve the purposes of eliminating the relativity among the words and simplifying text vectors to realize dimension reduction.
Random projection:
random projection is a simple and effective dimension reduction method, which is different from some feature extraction methods based on the appearance. Feature extraction is completely independent of the original sample data set and does not cause significant distortion of the data. The data after dimension reduction still keeps the important characteristic information contained in the original high-dimension data, and a transformation matrix is obtained without matrix decomposition, so that the capability of processing the data in real time can be greatly improved.
When the traditional text similarity comparison problem is oriented, the existing papers mainly use simple common words to carry out similarity comparison, but the similarity comparison effect of the method on semantic topic association among documents is poor.
Disclosure of Invention
The invention aims to: aiming at the problems, the invention provides a text similarity calculation method based on latent semantic analysis and random projection, which changes the limitation of the traditional calculation method, reduces the dimension of a vector space by utilizing a plurality of dimension reduction modes, and effectively improves the accuracy and the reliability of the result.
The technical scheme is as follows: the invention provides a text similarity calculation method based on latent semantic analysis and random projection, which comprises the following steps:
(1) Vectorizing tag to obtain an entry tag vector set V1, and obtaining a tag weight vector set V2 by using a TF-IDF algorithm;
(2) Using an LSA algorithm to V2 to obtain an LSA model M1 and an index library I1;
(3) Using a random projection algorithm to V2 to obtain an RP model M2 and an index library I2;
(4) And (3) using TF-IDF to process the corpus to be processed, and performing LSA and RP processing to obtain a final recommendation set.
Further, the specific steps for obtaining the tag weight vector set V2 in the step (1) are as follows:
(1.1) defining D1 as an encyclopedia entry dataset, d1= { id1, title1, paramgram 1, image1, url1, tag1}, wherein id1, title1, paramgram 1, image1, url1, tag1 represent a number, a title, a paragraph, a picture link, a web page link, and an entry tag, respectively;
(1.2) T1= { w was obtained by using split method on tag1 i1 ,w i2 ,…,w in },w iA Is the A-th entry tag set of encyclopedic entry data set, wherein the variable A epsilon [1, n];
(1.3) obtaining Dictionary 1 by using Dictionary method for T1;
(1.4) saving dictionary Dict1 locally;
(1.5) obtaining the term tag vector set v1= { V by using Doc2Bow method for T1 i1 ,v i2 ,…,v in },v iA Is the A-th entry tag vector of the entry tag vector set V1, where the variables A ε [1, n];
(1.6) obtaining an entry tag weight vector set v2= { V by performing a TF-IDF method on V1 j1 ,v j2 ,…,v jn },v jA Is the A-th term tag weight vector of the term tag weight vector set V2, wherein the variables A E [1, n ]];
(1.7) save the set of tag weight vectors V2 locally.
Further, the specific steps of obtaining the LSA model M1 and the index library I1 by using the LSA algorithm for V2 in the step (2) are as follows:
(2.1) local loading of the tag weight vector set V3, v3= { V k1 ,v k2 ,…,v kn },v kB Is the B weight vector of the term tag weight vector set V3, where B is [1, n ]];
(2.2) loading dictionary din 2 from local;
(2.3) define id2 word=dct 2, topic number num_topics=300;
(2.4) training the V3 by using an LSA method, and obtaining a model M1 by introducing parameters id2word and num_topics;
(2.5) processing V3 through a model M1 to obtain a packaging corpus C1;
(2.6) establishing an index library for the C1 to obtain an index library I1;
(2.7) saving the model M1 and the index library I1.
Further, the specific steps of obtaining the RP model M2 and the index library I2 by using the random projection algorithm for V2 in the step (3) are as follows:
(3.1) local loading of the tag weight vector set V4, v4= { V l1 ,v l2 ,…,v ln },v lC Is the C weight vector of the term tag weight vector set V4, where C is [1, n ]];
(3.2) defining a topic number num_topics=500;
(3.3) training the V4 by using an RP method, and obtaining a model M2 by the input parameter num_topics;
(3.4) processing V4 through a model M2 to obtain a packaging corpus C2;
(3.5) establishing an index library for the C2 to obtain an index library I2;
(3.6) saving the model M2 and the index library I2.
Further, the specific steps for obtaining the final recommended set in the step (4) are as follows:
(4.1) defining D2 as an encyclopedia entry test set, d2= { id2, title2, paramgram 2, image2, url2, tag2}, wherein id2, title2, paramgram 2, image2, url2, tag2 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;
(42) taking title2 as input, t2= { w was obtained by using split method on tag2 j1 ,w j2 ,…,w jn },w jD Is the D-th entry tag set of the encyclopedic entry data set, wherein the variables D epsilon [1, n];
(4.3) obtaining the term tag vector set v5= { V by using Doc2Bow method for T2 m1 ,v m2 ,…,v mn },v mE The E-th entry tag vector of the entry tag vector set V5, where the variable E.epsilon.1, n];
(4.4) obtaining an entry tag weight vector set V6 = { V by performing TF-IDF method on V5 o1 ,v o2 ,…,v on },v oF The F-th term tag weight vector of the term tag weight vector set V6, wherein the variables F E [1, n];
(4.5) defining the variable k=1 as a cyclic variable for traversing V6;
(4.6) definition sets R1, R2 and R3, r1= { sim i1 ,sim i2 ,…,sim in },R2={sim j1 ,sim j2 ,…,sim jn R3 is an empty set, sim iG And sim jG Respectively represent the G-th similarity set, sim in R1 and R2 iG And sim jG The initial value is null, wherein G.epsilon.1, n];
(4.7) importing an LSA model M3 and a random projection model M4, and importing an LSA index library I3 and a random projection index library I4;
(4.8) if k < = n go to step (4.9), otherwise go to step (4.14);
(4.9) by the method of v ok Packaging to obtain vec1 by LSA method k By v of ok Packaging by random projection method to obtain vec2 k ;
(4.10) by pairing vec1 k Index library I3 is searched, and cosine similarity is used for calculating and obtaining elements and vec1 in I3 k Is stored in sim ik By a method of v c2 k Index library I4 is retrieved, and cosine similarity is used for calculating and obtaining elements and vec2 in I4 k Is stored in sim jk ;
(4.11) sim ik And sim jk Adding the corresponding elements and taking an average value to obtain sim lk ;
(4.12) sim lk Inserted into R3;
(4.13) k=k+1, proceeding to step (4.8);
and (4.14) taking 8 elements with highest similarity in each set of R3 to form a set, and storing the set into a result set R4, wherein each element in the R4 is a recommended set.
The invention adopts the technical scheme and has the following beneficial effects:
according to the method, the text similarity calculation is carried out on the entry obtained from the hundred degrees encyclopedia by utilizing potential semantic analysis and random projection, the limitation of the traditional calculation method is changed, the vector space is reduced in dimension by utilizing various dimension reduction modes, and the accuracy and the reliability of the result are effectively improved.
Drawings
FIG. 1 is a general flow chart of the present invention;
FIG. 2 is a flowchart showing the TF-IDF algorithm in FIG. 1;
FIG. 3 is a specific flow chart of the LSA algorithm of FIG. 1;
FIG. 4 is a flowchart showing the random projection algorithm of FIG. 1;
fig. 5 is a flowchart showing the similarity recommendation function in fig. 1.
Detailed Description
The present invention is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the invention only and not limiting the scope of the invention, and that modifications of the invention, which are equivalent to those skilled in the art to which the invention pertains, will fall within the scope of the invention as defined in the claims appended hereto.
As shown in fig. 1 to 5, the text similarity calculation method based on latent semantic analysis and random projection according to the present invention includes the following steps:
step 1: the tag is vectorized to obtain an entry tag vector set V1, and a TF-IDF algorithm is used for the entry tag vector set to obtain a tag weight vector set V2, and the specific method is as follows:
step 1.1: define D1 as an encyclopedic entry dataset, d1= { id1, title1, parameter 1, image1, url1, tag1}, where id1, title1, parameter 1, image1, url1, tag1 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;
step 1.2: t1= { w was obtained by using split method on tag1 i1 ,w i2 ,…,w in },w iA Is the A-th entry tag set of encyclopedic entry data set, wherein the variable A epsilon [1, n];
Step 1.3: dictionary 1 is obtained by using Dictionary method on T1;
step 1.4: storing dictionary Dict1 to local;
step 1.5: obtaining an entry tag vector set V1= { V by using a Doc2Bow method on T1 i1 ,v i2 ,…,v in },v iA Is the A-th entry tag vector of the entry tag vector set V1, where the variables A ε [1, n];
Step 1.6: obtaining a term tag weight vector set V2= { V by performing TF-IDF method on V1 j1 ,v j2 ,…,v jn },v jA Is the A-th term tag weight vector of the term tag weight vector set V2, wherein the variables A E [1, n ]];
Step 1.7: the set of tag weight vectors V2 is saved locally.
Step 2: the LSA algorithm is used for V2 to obtain an LSA model M1 and an index library I1, and the specific method is as follows:
step 2.1: local loading of tag weight vector set V3, v3= { V k1 ,v k2 ,…,v kn },v kB Is the B weight vector of the term tag weight vector set V3, where B is [1, n ]];
Step 2.2: loading dictionary Dict2 from local;
step 2.3: defining id2 word=dct 2, topic number num_topics=300;
step 2.4: training V3 by using an LSA method, and obtaining a model M1 by introducing parameters id2word and num_topics;
step 2.5: processing V3 through a model M1 to obtain a packaging corpus C1;
step 2.6: establishing an index library for the C1 to obtain an index library I1;
step 2.7: the model M1 and the index library I1 are saved.
Step 3: the RP model M2 and the index library I2 are obtained by using a random projection algorithm on the V2, and the specific method comprises the following steps:
step 3.1: local loading of tag weight vector set V4, v4= { V l1 ,v l2 ,…,v ln },v lC Is the C weight vector of the term tag weight vector set V4, where C is [1, n ]];
Step 3.2: defining a topic number num_topics=500;
step 3.3: training V4 by using an RP method, and obtaining a model M2 by an incoming parameter num_topics;
step 3.4: processing V4 through a model M2 to obtain a packaging corpus C2;
step 3.5: establishing an index library for C2 to obtain an index library I2;
step 3.6: the model M2 and the index library I2 are saved.
Step 4: the method comprises the steps of using TF-IDF to process corpus to be processed, and performing LSA and RP to obtain a final recommendation set, wherein the specific method comprises the following steps:
step 4.1: define D2 as an encyclopedic entry test set, d2= { id2, title2, parameter 2, image2, url2, tag2}, where id2, title2, parameter 2, image2, url2, tag2 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;
step 4.2: using title2 as input, t2= { w was obtained by using split method on tag2 j1 ,w j2 ,…,w jn },w jD Is the D-th entry tag set of the encyclopedic entry data set, wherein the variables D epsilon [1, n];
Step 4.3: obtaining an entry tag vector set V5= { V by using a Doc2Bow method for T2 m1 ,v m2 ,…,v mn },v mE The E-th entry tag vector of the entry tag vector set V5, where the variable E.epsilon.1, n];
Step 4.4: obtaining a term tag weight vector set V6= { V by performing TF-IDF method on V5 o1 ,v o2 ,…,v on },v oF Is an entry markThe F-th entry tag weight vector of the tag weight vector set V6, where the variable F ε [1, n];
Step 4.5: defining variable k=1 as a cyclic variable for traversing V6;
step 4.6: definition sets R1, R2 and R3, r1= { sim i1 ,sim i2 ,…,sim in },R2={sim j1 ,sim j2 ,…,sim jn R3 is an empty set, sim iG And sim jG Respectively represent the G-th similarity set, sim in R1 and R2 iG And sim jG The initial value is null, wherein G.epsilon.1, n];
Step 4.7: importing an LSA model M3 and a random projection model M4, and importing an LSA index library I3 and a random projection index library I4;
step 4.8: if k < = n then go to step 4.9, otherwise go to step 4.14;
step 4.9: by v pair ok Packaging to obtain vec1 by LSA method k By v of ok Packaging by random projection method to obtain vec2 k ;
Step 4.10: by pairing vec1 k Index library I3 is searched, and cosine similarity is used for calculating and obtaining elements and vec1 in I3 k Is stored in sim ik By a method of v c2 k Index library I4 is retrieved, and cosine similarity is used for calculating and obtaining elements and vec2 in I4 k Is stored in sim jk ;
Step 4.11: will sim ik And sim jk Adding the corresponding elements and taking an average value to obtain sim lk ;
Step 4.12: will sim lk Inserted into R3;
step 4.13: k=k+1, go to step 4.8;
step 4.14: and 8 elements with highest similarity in each set of R3 are taken to form a set, the set is stored in a result set R4, and each element in the R4 is the recommended set.
The following is a table of variables according to the present invention:
table 1 global variable table
Table 2 step 1 variable table
Variable definition | Variable name |
D1 | Encyclopedic entry dataset |
id1 | Entry dataset numbering |
title1 | Entry dataset header |
paragraph1 | Entry dataset paragraph |
image1 | Entry dataset picture linking |
url1 | Entry dataset web page linking |
tag1 | Entry label |
T1 | Collection of term tag sets |
w iA | The A-th entry tag set in the collection of entry tag sets |
Dict1 | Dictionary with a plurality of dictionary marks |
v iA | The A-th term tag vector of V1 |
v jA | The A-th term tag weight vector of V2 |
TABLE 3 step 2 variable table
Variable definition | Variable name |
V3 | Tag weight vector set |
v kB | The B weight vector of V3 |
Dict2 | Dictionary with a plurality of dictionary marks |
id2word | |
num_topics | Parameter 2 of the LSA method |
C1 | Corpus LSA packaging V3 |
Table 4 step 3 variable table
Variable definition | Variable name |
V4 | Tag weight vector set |
v lC | C weight vector of V4 |
num_topics | |
C2 | Corpus RP-packaged V4 |
num_topics | Parameter 2 of the LSA method |
C1 | Corpus LSA packaging V3 |
Table 5 step 4 variable table
In order to illustrate the effectiveness of the method, the method uses latent semantic analysis and random projection to perform space projection on the entry labels and uses cosine similarity to calculate the result by performing data processing on 46935 historic entries. The method effectively overcomes the limitation of traditional text similarity calculation, and optimizes the text similarity calculation through the combination of two algorithms.
The invention can be combined with a computer system so as to complete text similarity calculation and output a recommendation set.
The invention creatively provides a method for combining a latent semantic analysis method with a random projection method to project a text so as to obtain an optimal recommendation result.
The foregoing is merely exemplary of the present invention and is not intended to limit the present invention. All equivalents and alternatives falling within the spirit of the invention are intended to be included within the scope of the invention. The invention, which is not described in detail, should be considered as belonging to the prior art known to those skilled in the art.
Claims (2)
1. The text similarity calculation method based on latent semantic analysis and random projection is characterized by comprising the following steps of:
(1) Vectorizing tag to obtain a vocabulary entry tag vector set V1, and obtaining a vocabulary entry tag weight vector set V2 by using a TF-IDF algorithm;
(2) Using an LSA algorithm to V2 to obtain an LSA model M1 and an index library I1;
(3) Using a random projection algorithm to V2 to obtain an RP model M2 and an index library I2;
(4) The method comprises the steps of using TF-IDF to process corpus to be processed, and performing LSA and RP processing to obtain a final recommendation set;
the specific steps of obtaining the LSA model M1 and the index library I1 by using the LSA algorithm for V2 in the step (2) are as follows:
(2.1) load the vocabulary entry tag weight vector set from local V2, v2= { V k1 ,v k2 ,…,v kn },v kB Is the B weight vector of the term tag weight vector set V2, where B is [1, n ]];
(2.2) loading dictionary din 2 from local;
(2.3) define id2 word=dct 2, topic number num_topics=300;
(2.4) training the V2 by using an LSA method, and obtaining a model M1 by introducing parameters id2word and num_topics;
(2.5) processing V2 through a model M1 to obtain a packaging corpus C1;
(2.6) establishing an index library for the C1 to obtain an index library I1;
(2.7) saving the model M1 and the index library I1;
the specific steps of obtaining the RP model M2 and the index library I2 by using a random projection algorithm to the V2 in the step (3) are as follows:
(3.1) load the vocabulary entry tag weight vector set from local V2, v2= { V l1 ,v l2 ,…,v ln },v lC Is the C weight vector of the term tag weight vector set V2, where C is [1, n ]];
(3.2) defining a topic number num_topics=500;
(3.3) training the V2 by using an RP method, and obtaining a model M2 by the input parameter num_topics;
(3.4) processing V2 through a model M2 to obtain a packaging corpus C2;
(3.5) establishing an index library for the C2 to obtain an index library I2;
(3.6) saving the model M2 and the index library I2;
the specific steps for obtaining the final recommended set in the step (4) are as follows:
(4.1) defining D2 as an encyclopedia entry test set, d2= { id2, title2, paramgram 2, image2, url2, tag2}, wherein id2, title2, paramgram 2, image2, url2, tag2 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;
(4.2) Using title2 as input, T2 = { w was obtained by using split method on tag2 j1 ,w j2 ,…,w jn },w jD Is the D-th entry tag set of the encyclopedic entry data set, wherein the variables D epsilon [1, n];
(4.3) obtaining the term tag vector set v1= { V by using Doc2Bow method for T2 m1 ,v m2 ,…,v mn },v mE The E-th entry tag vector of the entry tag vector set V1, where the variable E.epsilon.1, n];
(4.4) obtaining an entry tag weight vector set V2= { V by performing TF-IDF method on V1 o1 ,v o2 ,…,v on },v oF The F-th term tag weight vector of the term tag weight vector set V2, wherein the variables F E [1, n];
(4.5) defining the variable k=1 as a cyclic variable for traversing V2;
(4.6) definition sets R1, R2 and R3, r1= { sim i1 ,sim i2 ,…,sim in },R2={sim j1 ,sim j2 ,…,sim jn R3 is an empty set, sim iG And sim jG Respectively represent the G-th similarity set, sim in R1 and R2 iG And sim jG The initial value is null, wherein G.epsilon.1, n];
(4.7) importing an LSA model M1 and a random projection model M2, and importing an LSA index library I1 and a random projection index library I2;
(4.8) if k < = n go to step (4.9), otherwise go to step (4.14);
(4.9) by the method of v ok Packaging to obtain vec1 by LSA method k By v of ok Packaging by random projection method to obtain vec2 k ;
(4.10) by pairing vec1 k Index library I1 is searched, and cosine similarity is used for calculating and obtaining elements and vec1 in I1 k Is stored in sim ik By a method of v c2 k Index database I2The cosine similarity is used for calculating and obtaining the element and vec2 in the element I2 k Is stored in sim jk ;
(4.11) sim ik And sim jk Adding the corresponding elements and taking an average value to obtain sim lk ;
(4.12) sim lk Inserted into R3;
(4.13) k=k+1, proceeding to step (4.8);
and (4.14) taking 8 elements with highest similarity in each set of R3 to form a set, and storing the set into a result set R4, wherein each element in the R4 is a recommended set.
2. The text similarity calculation method based on latent semantic analysis and random projection according to claim 1, wherein the specific steps of obtaining the term tag weight vector set V2 in the step (1) are as follows:
(1.1) defining D1 as an encyclopedia entry dataset, d1= { id1, title1, paramgram 1, image1, url1, tag1}, wherein id1, title1, paramgram 1, image1, url1, tag1 represent a number, a title, a paragraph, a picture link, a web page link, and an entry tag, respectively;
(1.2) T1= { w was obtained by using split method on tag1 i1 ,w i2 ,…,w in },w iA Is the A-th entry tag set of encyclopedic entry data set, wherein the variable A epsilon [1, n];
(1.3) obtaining Dictionary 1 by using Dictionary method for T1;
(1.4) saving dictionary Dict1 locally;
(1.5) obtaining the term tag vector set v1= { V by using Doc2Bow method for T1 i1 ,v i2 ,…,v in },v iA Is the A-th entry tag vector of the entry tag vector set V1, where the variables A ε [1, n];
(1.6) obtaining an entry tag weight vector set v2= { V by performing a TF-IDF method on V1 j1 ,v j2 ,…,v jn },v jA Is the A-th term tag weight vector of the term tag weight vector set V2, wherein the variables A E [1, n ]];
(1.7) saving the entry tag weight vector set V2 locally.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910598004.7A CN110399458B (en) | 2019-07-04 | 2019-07-04 | Text similarity calculation method based on latent semantic analysis and random projection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910598004.7A CN110399458B (en) | 2019-07-04 | 2019-07-04 | Text similarity calculation method based on latent semantic analysis and random projection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110399458A CN110399458A (en) | 2019-11-01 |
CN110399458B true CN110399458B (en) | 2023-05-26 |
Family
ID=68323669
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910598004.7A Active CN110399458B (en) | 2019-07-04 | 2019-07-04 | Text similarity calculation method based on latent semantic analysis and random projection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110399458B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111581378B (en) * | 2020-04-28 | 2024-04-26 | 中国工商银行股份有限公司 | Method and device for establishing user consumption label system based on transaction data |
CN112884053B (en) * | 2021-02-28 | 2022-04-15 | 江苏匠算天诚信息科技有限公司 | Website classification method, system, equipment and medium based on image-text mixed characteristics |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273474A (en) * | 2017-06-08 | 2017-10-20 | 成都数联铭品科技有限公司 | Autoabstract abstracting method and system based on latent semantic analysis |
CN109165382A (en) * | 2018-08-03 | 2019-01-08 | 南京工业大学 | A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines |
-
2019
- 2019-07-04 CN CN201910598004.7A patent/CN110399458B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273474A (en) * | 2017-06-08 | 2017-10-20 | 成都数联铭品科技有限公司 | Autoabstract abstracting method and system based on latent semantic analysis |
CN109165382A (en) * | 2018-08-03 | 2019-01-08 | 南京工业大学 | A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines |
Non-Patent Citations (2)
Title |
---|
Dimensionality Reduction by Random Projection and Latent Semantic Indexing;Jessica Lin 等;《CiteSeer》;20031231;第1页-10页 * |
基于潜在语义分析的程序代码相似度检测;汪瑾;《科技创新与应用》;20121231;第60页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110399458A (en) | 2019-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111914558B (en) | Course knowledge relation extraction method and system based on sentence bag attention remote supervision | |
CN109885692B (en) | Knowledge data storage method, apparatus, computer device and storage medium | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN106844658B (en) | Automatic construction method and system of Chinese text knowledge graph | |
CN106446148A (en) | Cluster-based text duplicate checking method | |
CN106126619A (en) | A kind of video retrieval method based on video content and system | |
CN105469096A (en) | Feature bag image retrieval method based on Hash binary code | |
CN113569050B (en) | Method and device for automatically constructing government affair field knowledge map based on deep learning | |
US9146988B2 (en) | Hierarchal clustering method for large XML data | |
CN113962293B (en) | LightGBM classification and representation learning-based name disambiguation method and system | |
Zhang et al. | Continuous word embeddings for detecting local text reuses at the semantic level | |
CN110399458B (en) | Text similarity calculation method based on latent semantic analysis and random projection | |
Odeh et al. | Arabic text categorization algorithm using vector evaluation method | |
CN112257386B (en) | Method for generating scene space relation information layout in text-to-scene conversion | |
Dawar et al. | Comparing topic modeling and named entity recognition techniques for the semantic indexing of a landscape architecture textbook | |
Ghosh | Sentiment analysis of IMDb movie reviews: A comparative study on performance of hyperparameter-tuned classification algorithms | |
Dhillon et al. | Semi-supervised multi-task learning of structured prediction models for web information extraction | |
CN105426490A (en) | Tree structure based indexing method | |
Pu et al. | A vision-based approach for deep web form extraction | |
CN114116953A (en) | Efficient semantic expansion retrieval method and device based on word vectors and storage medium | |
KR101240330B1 (en) | System and method for mutidimensional document classification | |
CN112836014A (en) | Multi-field interdisciplinary-oriented expert selection method | |
Wang et al. | Summarizing the differences from microblogs | |
Bakr et al. | Efficient incremental phrase-based document clustering | |
Kutty et al. | XML Documents Clustering Using Tensor Space Model--A Preliminary Study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20191101 Assignee: Fanyun software (Nanjing) Co.,Ltd. Assignor: HUAIYIN INSTITUTE OF TECHNOLOGY Contract record no.: X2023980052895 Denomination of invention: A Text Similarity Calculation Method Based on Latent Semantic Analysis and Random Projection Granted publication date: 20230526 License type: Common License Record date: 20231219 |
|
EE01 | Entry into force of recordation of patent licensing contract |