CN110399458B - Text similarity calculation method based on latent semantic analysis and random projection - Google Patents

Text similarity calculation method based on latent semantic analysis and random projection Download PDF

Info

Publication number
CN110399458B
CN110399458B CN201910598004.7A CN201910598004A CN110399458B CN 110399458 B CN110399458 B CN 110399458B CN 201910598004 A CN201910598004 A CN 201910598004A CN 110399458 B CN110399458 B CN 110399458B
Authority
CN
China
Prior art keywords
sim
weight vector
entry
tag
obtaining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910598004.7A
Other languages
Chinese (zh)
Other versions
CN110399458A (en
Inventor
朱全银
吴思凯
王啸
赵建洋
宗慧
冯万利
周泓
丁瑾
陈伯伦
曹苏群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaiyin Institute of Technology
Original Assignee
Huaiyin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaiyin Institute of Technology filed Critical Huaiyin Institute of Technology
Priority to CN201910598004.7A priority Critical patent/CN110399458B/en
Publication of CN110399458A publication Critical patent/CN110399458A/en
Application granted granted Critical
Publication of CN110399458B publication Critical patent/CN110399458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text similarity calculation method based on latent semantic analysis and random projection, which is suitable for the problem of general unsupervised text clustering. The method comprises the steps of firstly converting a label text to be processed into a word bag model, and weighting the label text by using a TF-IDF algorithm to obtain a weight vector set. And then processing the weight vector set by using an LSA algorithm to obtain an LSA index library, and processing the weight vector set by using a random projection algorithm to obtain an RP index library. Finally, after TF-IDF processing is carried out on the corpus to be calculated, LSA algorithm and RP algorithm are respectively used for processing and then the corpus to be calculated is compared with the content of the index library, and the text similarity is obtained. The invention can calculate the effective similarity of the text content and recommend the related content through the text with higher similarity.

Description

Text similarity calculation method based on latent semantic analysis and random projection
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a text similarity calculation method based on latent semantic analysis and random projection.
Background
In the conventional text recommendation algorithm, researchers choose to perform similarity calculation through a common vector space model or the like. By exploring semantic association among texts, a reliable text similarity calculation method is provided for a related system by combining latent semantic analysis and random projection.
Latent semantic analysis:
the potential semantic analysis is a new information retrieval algebraic model, is a calculation theory and method for knowledge acquisition and display, and uses a statistical calculation method to analyze a large number of text sets, thereby extracting potential semantic structures among words, and using the potential semantic structures to represent the words and the text, so as to achieve the purposes of eliminating the relativity among the words and simplifying text vectors to realize dimension reduction.
Random projection:
random projection is a simple and effective dimension reduction method, which is different from some feature extraction methods based on the appearance. Feature extraction is completely independent of the original sample data set and does not cause significant distortion of the data. The data after dimension reduction still keeps the important characteristic information contained in the original high-dimension data, and a transformation matrix is obtained without matrix decomposition, so that the capability of processing the data in real time can be greatly improved.
When the traditional text similarity comparison problem is oriented, the existing papers mainly use simple common words to carry out similarity comparison, but the similarity comparison effect of the method on semantic topic association among documents is poor.
Disclosure of Invention
The invention aims to: aiming at the problems, the invention provides a text similarity calculation method based on latent semantic analysis and random projection, which changes the limitation of the traditional calculation method, reduces the dimension of a vector space by utilizing a plurality of dimension reduction modes, and effectively improves the accuracy and the reliability of the result.
The technical scheme is as follows: the invention provides a text similarity calculation method based on latent semantic analysis and random projection, which comprises the following steps:
(1) Vectorizing tag to obtain an entry tag vector set V1, and obtaining a tag weight vector set V2 by using a TF-IDF algorithm;
(2) Using an LSA algorithm to V2 to obtain an LSA model M1 and an index library I1;
(3) Using a random projection algorithm to V2 to obtain an RP model M2 and an index library I2;
(4) And (3) using TF-IDF to process the corpus to be processed, and performing LSA and RP processing to obtain a final recommendation set.
Further, the specific steps for obtaining the tag weight vector set V2 in the step (1) are as follows:
(1.1) defining D1 as an encyclopedia entry dataset, d1= { id1, title1, paramgram 1, image1, url1, tag1}, wherein id1, title1, paramgram 1, image1, url1, tag1 represent a number, a title, a paragraph, a picture link, a web page link, and an entry tag, respectively;
(1.2) T1= { w was obtained by using split method on tag1 i1 ,w i2 ,…,w in },w iA Is the A-th entry tag set of encyclopedic entry data set, wherein the variable A epsilon [1, n];
(1.3) obtaining Dictionary 1 by using Dictionary method for T1;
(1.4) saving dictionary Dict1 locally;
(1.5) obtaining the term tag vector set v1= { V by using Doc2Bow method for T1 i1 ,v i2 ,…,v in },v iA Is the A-th entry tag vector of the entry tag vector set V1, where the variables A ε [1, n];
(1.6) obtaining an entry tag weight vector set v2= { V by performing a TF-IDF method on V1 j1 ,v j2 ,…,v jn },v jA Is the A-th term tag weight vector of the term tag weight vector set V2, wherein the variables A E [1, n ]];
(1.7) save the set of tag weight vectors V2 locally.
Further, the specific steps of obtaining the LSA model M1 and the index library I1 by using the LSA algorithm for V2 in the step (2) are as follows:
(2.1) local loading of the tag weight vector set V3, v3= { V k1 ,v k2 ,…,v kn },v kB Is the B weight vector of the term tag weight vector set V3, where B is [1, n ]];
(2.2) loading dictionary din 2 from local;
(2.3) define id2 word=dct 2, topic number num_topics=300;
(2.4) training the V3 by using an LSA method, and obtaining a model M1 by introducing parameters id2word and num_topics;
(2.5) processing V3 through a model M1 to obtain a packaging corpus C1;
(2.6) establishing an index library for the C1 to obtain an index library I1;
(2.7) saving the model M1 and the index library I1.
Further, the specific steps of obtaining the RP model M2 and the index library I2 by using the random projection algorithm for V2 in the step (3) are as follows:
(3.1) local loading of the tag weight vector set V4, v4= { V l1 ,v l2 ,…,v ln },v lC Is the C weight vector of the term tag weight vector set V4, where C is [1, n ]];
(3.2) defining a topic number num_topics=500;
(3.3) training the V4 by using an RP method, and obtaining a model M2 by the input parameter num_topics;
(3.4) processing V4 through a model M2 to obtain a packaging corpus C2;
(3.5) establishing an index library for the C2 to obtain an index library I2;
(3.6) saving the model M2 and the index library I2.
Further, the specific steps for obtaining the final recommended set in the step (4) are as follows:
(4.1) defining D2 as an encyclopedia entry test set, d2= { id2, title2, paramgram 2, image2, url2, tag2}, wherein id2, title2, paramgram 2, image2, url2, tag2 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;
(42) taking title2 as input, t2= { w was obtained by using split method on tag2 j1 ,w j2 ,…,w jn },w jD Is the D-th entry tag set of the encyclopedic entry data set, wherein the variables D epsilon [1, n];
(4.3) obtaining the term tag vector set v5= { V by using Doc2Bow method for T2 m1 ,v m2 ,…,v mn },v mE The E-th entry tag vector of the entry tag vector set V5, where the variable E.epsilon.1, n];
(4.4) obtaining an entry tag weight vector set V6 = { V by performing TF-IDF method on V5 o1 ,v o2 ,…,v on },v oF The F-th term tag weight vector of the term tag weight vector set V6, wherein the variables F E [1, n];
(4.5) defining the variable k=1 as a cyclic variable for traversing V6;
(4.6) definition sets R1, R2 and R3, r1= { sim i1 ,sim i2 ,…,sim in },R2={sim j1 ,sim j2 ,…,sim jn R3 is an empty set, sim iG And sim jG Respectively represent the G-th similarity set, sim in R1 and R2 iG And sim jG The initial value is null, wherein G.epsilon.1, n];
(4.7) importing an LSA model M3 and a random projection model M4, and importing an LSA index library I3 and a random projection index library I4;
(4.8) if k < = n go to step (4.9), otherwise go to step (4.14);
(4.9) by the method of v ok Packaging to obtain vec1 by LSA method k By v of ok Packaging by random projection method to obtain vec2 k
(4.10) by pairing vec1 k Index library I3 is searched, and cosine similarity is used for calculating and obtaining elements and vec1 in I3 k Is stored in sim ik By a method of v c2 k Index library I4 is retrieved, and cosine similarity is used for calculating and obtaining elements and vec2 in I4 k Is stored in sim jk
(4.11) sim ik And sim jk Adding the corresponding elements and taking an average value to obtain sim lk
(4.12) sim lk Inserted into R3;
(4.13) k=k+1, proceeding to step (4.8);
and (4.14) taking 8 elements with highest similarity in each set of R3 to form a set, and storing the set into a result set R4, wherein each element in the R4 is a recommended set.
The invention adopts the technical scheme and has the following beneficial effects:
according to the method, the text similarity calculation is carried out on the entry obtained from the hundred degrees encyclopedia by utilizing potential semantic analysis and random projection, the limitation of the traditional calculation method is changed, the vector space is reduced in dimension by utilizing various dimension reduction modes, and the accuracy and the reliability of the result are effectively improved.
Drawings
FIG. 1 is a general flow chart of the present invention;
FIG. 2 is a flowchart showing the TF-IDF algorithm in FIG. 1;
FIG. 3 is a specific flow chart of the LSA algorithm of FIG. 1;
FIG. 4 is a flowchart showing the random projection algorithm of FIG. 1;
fig. 5 is a flowchart showing the similarity recommendation function in fig. 1.
Detailed Description
The present invention is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the invention only and not limiting the scope of the invention, and that modifications of the invention, which are equivalent to those skilled in the art to which the invention pertains, will fall within the scope of the invention as defined in the claims appended hereto.
As shown in fig. 1 to 5, the text similarity calculation method based on latent semantic analysis and random projection according to the present invention includes the following steps:
step 1: the tag is vectorized to obtain an entry tag vector set V1, and a TF-IDF algorithm is used for the entry tag vector set to obtain a tag weight vector set V2, and the specific method is as follows:
step 1.1: define D1 as an encyclopedic entry dataset, d1= { id1, title1, parameter 1, image1, url1, tag1}, where id1, title1, parameter 1, image1, url1, tag1 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;
step 1.2: t1= { w was obtained by using split method on tag1 i1 ,w i2 ,…,w in },w iA Is the A-th entry tag set of encyclopedic entry data set, wherein the variable A epsilon [1, n];
Step 1.3: dictionary 1 is obtained by using Dictionary method on T1;
step 1.4: storing dictionary Dict1 to local;
step 1.5: obtaining an entry tag vector set V1= { V by using a Doc2Bow method on T1 i1 ,v i2 ,…,v in },v iA Is the A-th entry tag vector of the entry tag vector set V1, where the variables A ε [1, n];
Step 1.6: obtaining a term tag weight vector set V2= { V by performing TF-IDF method on V1 j1 ,v j2 ,…,v jn },v jA Is the A-th term tag weight vector of the term tag weight vector set V2, wherein the variables A E [1, n ]];
Step 1.7: the set of tag weight vectors V2 is saved locally.
Step 2: the LSA algorithm is used for V2 to obtain an LSA model M1 and an index library I1, and the specific method is as follows:
step 2.1: local loading of tag weight vector set V3, v3= { V k1 ,v k2 ,…,v kn },v kB Is the B weight vector of the term tag weight vector set V3, where B is [1, n ]];
Step 2.2: loading dictionary Dict2 from local;
step 2.3: defining id2 word=dct 2, topic number num_topics=300;
step 2.4: training V3 by using an LSA method, and obtaining a model M1 by introducing parameters id2word and num_topics;
step 2.5: processing V3 through a model M1 to obtain a packaging corpus C1;
step 2.6: establishing an index library for the C1 to obtain an index library I1;
step 2.7: the model M1 and the index library I1 are saved.
Step 3: the RP model M2 and the index library I2 are obtained by using a random projection algorithm on the V2, and the specific method comprises the following steps:
step 3.1: local loading of tag weight vector set V4, v4= { V l1 ,v l2 ,…,v ln },v lC Is the C weight vector of the term tag weight vector set V4, where C is [1, n ]];
Step 3.2: defining a topic number num_topics=500;
step 3.3: training V4 by using an RP method, and obtaining a model M2 by an incoming parameter num_topics;
step 3.4: processing V4 through a model M2 to obtain a packaging corpus C2;
step 3.5: establishing an index library for C2 to obtain an index library I2;
step 3.6: the model M2 and the index library I2 are saved.
Step 4: the method comprises the steps of using TF-IDF to process corpus to be processed, and performing LSA and RP to obtain a final recommendation set, wherein the specific method comprises the following steps:
step 4.1: define D2 as an encyclopedic entry test set, d2= { id2, title2, parameter 2, image2, url2, tag2}, where id2, title2, parameter 2, image2, url2, tag2 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;
step 4.2: using title2 as input, t2= { w was obtained by using split method on tag2 j1 ,w j2 ,…,w jn },w jD Is the D-th entry tag set of the encyclopedic entry data set, wherein the variables D epsilon [1, n];
Step 4.3: obtaining an entry tag vector set V5= { V by using a Doc2Bow method for T2 m1 ,v m2 ,…,v mn },v mE The E-th entry tag vector of the entry tag vector set V5, where the variable E.epsilon.1, n];
Step 4.4: obtaining a term tag weight vector set V6= { V by performing TF-IDF method on V5 o1 ,v o2 ,…,v on },v oF Is an entry markThe F-th entry tag weight vector of the tag weight vector set V6, where the variable F ε [1, n];
Step 4.5: defining variable k=1 as a cyclic variable for traversing V6;
step 4.6: definition sets R1, R2 and R3, r1= { sim i1 ,sim i2 ,…,sim in },R2={sim j1 ,sim j2 ,…,sim jn R3 is an empty set, sim iG And sim jG Respectively represent the G-th similarity set, sim in R1 and R2 iG And sim jG The initial value is null, wherein G.epsilon.1, n];
Step 4.7: importing an LSA model M3 and a random projection model M4, and importing an LSA index library I3 and a random projection index library I4;
step 4.8: if k < = n then go to step 4.9, otherwise go to step 4.14;
step 4.9: by v pair ok Packaging to obtain vec1 by LSA method k By v of ok Packaging by random projection method to obtain vec2 k
Step 4.10: by pairing vec1 k Index library I3 is searched, and cosine similarity is used for calculating and obtaining elements and vec1 in I3 k Is stored in sim ik By a method of v c2 k Index library I4 is retrieved, and cosine similarity is used for calculating and obtaining elements and vec2 in I4 k Is stored in sim jk
Step 4.11: will sim ik And sim jk Adding the corresponding elements and taking an average value to obtain sim lk
Step 4.12: will sim lk Inserted into R3;
step 4.13: k=k+1, go to step 4.8;
step 4.14: and 8 elements with highest similarity in each set of R3 are taken to form a set, the set is stored in a result set R4, and each element in the R4 is the recommended set.
The following is a table of variables according to the present invention:
table 1 global variable table
Figure GDA0004178632820000061
Figure GDA0004178632820000071
Table 2 step 1 variable table
Variable definition Variable name
D1 Encyclopedic entry dataset
id1 Entry dataset numbering
title1 Entry dataset header
paragraph1 Entry dataset paragraph
image1 Entry dataset picture linking
url1 Entry dataset web page linking
tag1 Entry label
T1 Collection of term tag sets
w iA The A-th entry tag set in the collection of entry tag sets
Dict1 Dictionary with a plurality of dictionary marks
v iA The A-th term tag vector of V1
v jA The A-th term tag weight vector of V2
TABLE 3 step 2 variable table
Variable definition Variable name
V3 Tag weight vector set
v kB The B weight vector of V3
Dict2 Dictionary with a plurality of dictionary marks
id2word Parameter 1 of the LSA method
num_topics Parameter 2 of the LSA method
C1 Corpus LSA packaging V3
Table 4 step 3 variable table
Variable definition Variable name
V4 Tag weight vector set
v lC C weight vector of V4
num_topics Parameter 1 of the RP method
C2 Corpus RP-packaged V4
num_topics Parameter 2 of the LSA method
C1 Corpus LSA packaging V3
Table 5 step 4 variable table
Figure GDA0004178632820000081
/>
Figure GDA0004178632820000091
In order to illustrate the effectiveness of the method, the method uses latent semantic analysis and random projection to perform space projection on the entry labels and uses cosine similarity to calculate the result by performing data processing on 46935 historic entries. The method effectively overcomes the limitation of traditional text similarity calculation, and optimizes the text similarity calculation through the combination of two algorithms.
The invention can be combined with a computer system so as to complete text similarity calculation and output a recommendation set.
The invention creatively provides a method for combining a latent semantic analysis method with a random projection method to project a text so as to obtain an optimal recommendation result.
The foregoing is merely exemplary of the present invention and is not intended to limit the present invention. All equivalents and alternatives falling within the spirit of the invention are intended to be included within the scope of the invention. The invention, which is not described in detail, should be considered as belonging to the prior art known to those skilled in the art.

Claims (2)

1. The text similarity calculation method based on latent semantic analysis and random projection is characterized by comprising the following steps of:
(1) Vectorizing tag to obtain a vocabulary entry tag vector set V1, and obtaining a vocabulary entry tag weight vector set V2 by using a TF-IDF algorithm;
(2) Using an LSA algorithm to V2 to obtain an LSA model M1 and an index library I1;
(3) Using a random projection algorithm to V2 to obtain an RP model M2 and an index library I2;
(4) The method comprises the steps of using TF-IDF to process corpus to be processed, and performing LSA and RP processing to obtain a final recommendation set;
the specific steps of obtaining the LSA model M1 and the index library I1 by using the LSA algorithm for V2 in the step (2) are as follows:
(2.1) load the vocabulary entry tag weight vector set from local V2, v2= { V k1 ,v k2 ,…,v kn },v kB Is the B weight vector of the term tag weight vector set V2, where B is [1, n ]];
(2.2) loading dictionary din 2 from local;
(2.3) define id2 word=dct 2, topic number num_topics=300;
(2.4) training the V2 by using an LSA method, and obtaining a model M1 by introducing parameters id2word and num_topics;
(2.5) processing V2 through a model M1 to obtain a packaging corpus C1;
(2.6) establishing an index library for the C1 to obtain an index library I1;
(2.7) saving the model M1 and the index library I1;
the specific steps of obtaining the RP model M2 and the index library I2 by using a random projection algorithm to the V2 in the step (3) are as follows:
(3.1) load the vocabulary entry tag weight vector set from local V2, v2= { V l1 ,v l2 ,…,v ln },v lC Is the C weight vector of the term tag weight vector set V2, where C is [1, n ]];
(3.2) defining a topic number num_topics=500;
(3.3) training the V2 by using an RP method, and obtaining a model M2 by the input parameter num_topics;
(3.4) processing V2 through a model M2 to obtain a packaging corpus C2;
(3.5) establishing an index library for the C2 to obtain an index library I2;
(3.6) saving the model M2 and the index library I2;
the specific steps for obtaining the final recommended set in the step (4) are as follows:
(4.1) defining D2 as an encyclopedia entry test set, d2= { id2, title2, paramgram 2, image2, url2, tag2}, wherein id2, title2, paramgram 2, image2, url2, tag2 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;
(4.2) Using title2 as input, T2 = { w was obtained by using split method on tag2 j1 ,w j2 ,…,w jn },w jD Is the D-th entry tag set of the encyclopedic entry data set, wherein the variables D epsilon [1, n];
(4.3) obtaining the term tag vector set v1= { V by using Doc2Bow method for T2 m1 ,v m2 ,…,v mn },v mE The E-th entry tag vector of the entry tag vector set V1, where the variable E.epsilon.1, n];
(4.4) obtaining an entry tag weight vector set V2= { V by performing TF-IDF method on V1 o1 ,v o2 ,…,v on },v oF The F-th term tag weight vector of the term tag weight vector set V2, wherein the variables F E [1, n];
(4.5) defining the variable k=1 as a cyclic variable for traversing V2;
(4.6) definition sets R1, R2 and R3, r1= { sim i1 ,sim i2 ,…,sim in },R2={sim j1 ,sim j2 ,…,sim jn R3 is an empty set, sim iG And sim jG Respectively represent the G-th similarity set, sim in R1 and R2 iG And sim jG The initial value is null, wherein G.epsilon.1, n];
(4.7) importing an LSA model M1 and a random projection model M2, and importing an LSA index library I1 and a random projection index library I2;
(4.8) if k < = n go to step (4.9), otherwise go to step (4.14);
(4.9) by the method of v ok Packaging to obtain vec1 by LSA method k By v of ok Packaging by random projection method to obtain vec2 k
(4.10) by pairing vec1 k Index library I1 is searched, and cosine similarity is used for calculating and obtaining elements and vec1 in I1 k Is stored in sim ik By a method of v c2 k Index database I2The cosine similarity is used for calculating and obtaining the element and vec2 in the element I2 k Is stored in sim jk
(4.11) sim ik And sim jk Adding the corresponding elements and taking an average value to obtain sim lk
(4.12) sim lk Inserted into R3;
(4.13) k=k+1, proceeding to step (4.8);
and (4.14) taking 8 elements with highest similarity in each set of R3 to form a set, and storing the set into a result set R4, wherein each element in the R4 is a recommended set.
2. The text similarity calculation method based on latent semantic analysis and random projection according to claim 1, wherein the specific steps of obtaining the term tag weight vector set V2 in the step (1) are as follows:
(1.1) defining D1 as an encyclopedia entry dataset, d1= { id1, title1, paramgram 1, image1, url1, tag1}, wherein id1, title1, paramgram 1, image1, url1, tag1 represent a number, a title, a paragraph, a picture link, a web page link, and an entry tag, respectively;
(1.2) T1= { w was obtained by using split method on tag1 i1 ,w i2 ,…,w in },w iA Is the A-th entry tag set of encyclopedic entry data set, wherein the variable A epsilon [1, n];
(1.3) obtaining Dictionary 1 by using Dictionary method for T1;
(1.4) saving dictionary Dict1 locally;
(1.5) obtaining the term tag vector set v1= { V by using Doc2Bow method for T1 i1 ,v i2 ,…,v in },v iA Is the A-th entry tag vector of the entry tag vector set V1, where the variables A ε [1, n];
(1.6) obtaining an entry tag weight vector set v2= { V by performing a TF-IDF method on V1 j1 ,v j2 ,…,v jn },v jA Is the A-th term tag weight vector of the term tag weight vector set V2, wherein the variables A E [1, n ]];
(1.7) saving the entry tag weight vector set V2 locally.
CN201910598004.7A 2019-07-04 2019-07-04 Text similarity calculation method based on latent semantic analysis and random projection Active CN110399458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910598004.7A CN110399458B (en) 2019-07-04 2019-07-04 Text similarity calculation method based on latent semantic analysis and random projection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910598004.7A CN110399458B (en) 2019-07-04 2019-07-04 Text similarity calculation method based on latent semantic analysis and random projection

Publications (2)

Publication Number Publication Date
CN110399458A CN110399458A (en) 2019-11-01
CN110399458B true CN110399458B (en) 2023-05-26

Family

ID=68323669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910598004.7A Active CN110399458B (en) 2019-07-04 2019-07-04 Text similarity calculation method based on latent semantic analysis and random projection

Country Status (1)

Country Link
CN (1) CN110399458B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581378B (en) * 2020-04-28 2024-04-26 中国工商银行股份有限公司 Method and device for establishing user consumption label system based on transaction data
CN112884053B (en) * 2021-02-28 2022-04-15 江苏匠算天诚信息科技有限公司 Website classification method, system, equipment and medium based on image-text mixed characteristics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273474A (en) * 2017-06-08 2017-10-20 成都数联铭品科技有限公司 Autoabstract abstracting method and system based on latent semantic analysis
CN109165382A (en) * 2018-08-03 2019-01-08 南京工业大学 A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273474A (en) * 2017-06-08 2017-10-20 成都数联铭品科技有限公司 Autoabstract abstracting method and system based on latent semantic analysis
CN109165382A (en) * 2018-08-03 2019-01-08 南京工业大学 A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Dimensionality Reduction by Random Projection and Latent Semantic Indexing;Jessica Lin 等;《CiteSeer》;20031231;第1页-10页 *
基于潜在语义分析的程序代码相似度检测;汪瑾;《科技创新与应用》;20121231;第60页 *

Also Published As

Publication number Publication date
CN110399458A (en) 2019-11-01

Similar Documents

Publication Publication Date Title
CN111914558B (en) Course knowledge relation extraction method and system based on sentence bag attention remote supervision
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
CN106446148A (en) Cluster-based text duplicate checking method
CN106126619A (en) A kind of video retrieval method based on video content and system
CN105469096A (en) Feature bag image retrieval method based on Hash binary code
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
US9146988B2 (en) Hierarchal clustering method for large XML data
CN113962293B (en) LightGBM classification and representation learning-based name disambiguation method and system
Zhang et al. Continuous word embeddings for detecting local text reuses at the semantic level
CN110399458B (en) Text similarity calculation method based on latent semantic analysis and random projection
Odeh et al. Arabic text categorization algorithm using vector evaluation method
CN112257386B (en) Method for generating scene space relation information layout in text-to-scene conversion
Dawar et al. Comparing topic modeling and named entity recognition techniques for the semantic indexing of a landscape architecture textbook
Ghosh Sentiment analysis of IMDb movie reviews: A comparative study on performance of hyperparameter-tuned classification algorithms
Dhillon et al. Semi-supervised multi-task learning of structured prediction models for web information extraction
CN105426490A (en) Tree structure based indexing method
Pu et al. A vision-based approach for deep web form extraction
CN114116953A (en) Efficient semantic expansion retrieval method and device based on word vectors and storage medium
KR101240330B1 (en) System and method for mutidimensional document classification
CN112836014A (en) Multi-field interdisciplinary-oriented expert selection method
Wang et al. Summarizing the differences from microblogs
Bakr et al. Efficient incremental phrase-based document clustering
Kutty et al. XML Documents Clustering Using Tensor Space Model--A Preliminary Study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20191101

Assignee: Fanyun software (Nanjing) Co.,Ltd.

Assignor: HUAIYIN INSTITUTE OF TECHNOLOGY

Contract record no.: X2023980052895

Denomination of invention: A Text Similarity Calculation Method Based on Latent Semantic Analysis and Random Projection

Granted publication date: 20230526

License type: Common License

Record date: 20231219

EE01 Entry into force of recordation of patent licensing contract