CN110399458A - A kind of Text similarity computing method based on latent semantic analysis and accidental projection - Google Patents

A kind of Text similarity computing method based on latent semantic analysis and accidental projection Download PDF

Info

Publication number
CN110399458A
CN110399458A CN201910598004.7A CN201910598004A CN110399458A CN 110399458 A CN110399458 A CN 110399458A CN 201910598004 A CN201910598004 A CN 201910598004A CN 110399458 A CN110399458 A CN 110399458A
Authority
CN
China
Prior art keywords
entry
collection
sim
weight vectors
index database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910598004.7A
Other languages
Chinese (zh)
Other versions
CN110399458B (en
Inventor
朱全银
吴思凯
王啸
赵建洋
宗慧
冯万利
周泓
丁瑾
陈伯伦
曹苏群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaiyin Institute of Technology
Original Assignee
Huaiyin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaiyin Institute of Technology filed Critical Huaiyin Institute of Technology
Priority to CN201910598004.7A priority Critical patent/CN110399458B/en
Publication of CN110399458A publication Critical patent/CN110399458A/en
Application granted granted Critical
Publication of CN110399458B publication Critical patent/CN110399458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The Text similarity computing method based on latent semantic analysis and accidental projection that the invention discloses a kind of, suitable for universal Unsupervised text clustering problem.The present invention is converted to bag of words to label text to be processed first, and carries out assigning to it weighing to operate using TF-IDF algorithm and obtain weight vectors collection.Then weight vectors collection is handled using LSA algorithm to obtain LSA index database, then weight vectors collection is handled to obtain RP index database using accidental projection algorithm.It after corpus to be calculated is finally carried out TF-IDF processing, then is compared respectively using after LSA algorithm and RP algorithm process with index database content, obtains text similarity.The present invention can carry out effective similarity calculation to content of text, and carry out related content recommendation by the higher text of similarity.

Description

A kind of Text similarity computing method based on latent semantic analysis and accidental projection
Technical field
The invention belongs to natural language processing field, in particular to a kind of text based on latent semantic analysis and accidental projection This similarity calculating method.
Background technique
In traditional text proposed algorithm, researchers can select to carry out by common vector space model etc. similar Degree calculates.By semantic association between excavation text, provided reliably in conjunction with latent semantic analysis and accidental projection for related system Text similarity computing method.
The existing Research foundation of Zhu Quanyin et al. includes: Wanli Feng.Research of theme statement extraction for chinese literature based on lexical chain.International Journal of Multimedia and Ubiquitous Engineering,Vol.11,No.6(2016),pp.379- 388;Wanli Feng,Ying Li,Shangbing Gao,Yunyang Yan,Jianxun Xue.A novel flame edge detection algorithm via a novel active contour model.International Journal of Hybrid Information Technology,Vol.9,No.9(2016),pp.275-282;Liu Jinling, Method for mode matching [J] microelectronics and computer of the Feng Wanli based on Feature Dependence relationship, 2011,28 (12): 167- 170;Liu Jinling, Feng Wanli, Zhang Yahong initialize cluster class center and reconstruct text cluster [J] computer application of scaling function Research, 2011,28 (11): 4115-4117;Chinese short message text of Liu Jinling, Feng Wanli, the Zhang Yahong based on scale again is poly- Class method [J] computer engineering and application, 2012,48 (21): 146-150.;Zhu Quanyin, Pan Lu, Liu Wenru wait .Web scientific and technological News category extraction algorithm [J] Huaiyingong College journal, 2015,24 (5): 18-24;Li Xiang, Zhu Quanyin joint are clustered and are commented Shared collaborative filtering recommending [J] the computer science of sub-matrix and exploration, 2014,8 (6): 751-759;Quanyin Zhu, Sunqun Cao.A Novel Classifier-independent Feature Selection Algorithm for Imbalanced Datasets.2009,p:77-82;Quanyin Zhu,Yunyang Yan,Jin Ding,Jin Qian.The Case Study for Price Extracting of Mobile Phone Sell Online.2011,p: 282-285;Quanyin Zhu,Suqun Cao,Pei Zhou,Yunyang Yan,Hong Zhou.Integrated Price Forecast based on Dichotomy Backfilling and Disturbance Factor Algorithm.International Review on Computers and Software,2011,Vol.6(6):1089- 1093;Zhu Quanyin, Feng Wanli et al. application, openly with the related patents of authorization: a kind of intelligence of Feng Wanli, Shao Heshuai, Zhuan Jun is cold Hide car state monitoring wireless network terminal installation: CN203616634U [P] .2014;Zhu Quanyin, Hu Rongjing, He Suqun, Zhou Pei A kind of equal price forecasting of commodity method Chinese patent based on linear interpolation Yu Adaptive windowing mouth of: ZL 2,011 1 0423015.5,2015.07.01;Zhu Quanyin, Cao Suqun, Yan Yunyang, Hu Rong wait quietly, and one kind is repaired and upset based on two divided datas The price forecasting of commodity method Chinese patent of the factor: 2,011 1 0422274.6,2013.01.02 of ZL;Li Xiang, Zhu Quanyin, recklessly A kind of Cold Chain Logistics prestowage intelligent recommendation method China Patent Publication No. based on spectral clustering of Rong Lin, Zhou Hong: CN105654267A,2016.06.08;Zhu Quanyin, Xin Cheng, Li Xiang, Xu Kang et al., one kind are tested based on K-means and LDA is two-way The network behavior of card is accustomed to clustering method China Patent Publication No.: CN 106202480 A, 2016.12.07;Zhu Quanyin, Tanghai Wave, Yan Yunyang, Li Xiang, Hu Ronglin, Qu Xuexin, Shaowu is outstanding, Xu Kang, Zhaoyang, a kind of use based on deep learning of Qian Kai, Gaoyang Family literature reading interest analysis method China Patent Publication No.: CN108280114A, 2018.07.13;Zhu Quanyin, in the persimmon people, A kind of expert of knowledge based map of Hu Ronglin, Feng Wanli, Zhou Hong combines recommended method China Patent Publication No.: CN109062961A,2018.12.21。
Latent semantic analysis:
Latent semantic analysis is a kind of new information retrieval algebraic model, is the computational theory for knowledge acquisition and displaying And method, it analyzes a large amount of text set using the method that statistics calculates, to extract potential between word and word Semantic structure, and indicate word and text with this potential semantic structure, reach the correlation eliminated between word and simplify text The purpose of this vector realization dimensionality reduction.
Accidental projection:
Accidental projection is that a kind of simple and effective dimension about subtracts method, it is different from some feature extraction sides based on presentation Method.Feature extraction is totally independent of raw sample data set, not will cause the great distortion of data.Data after dimensionality reduction are still The important feature information that original high dimensional data is included is remained, and needs not move through matrix decomposition to acquire transformed matrix, The ability of real-time processing data can be greatly improved.
When comparing problem towards traditional text similarity, has paper and mainly carry out similarity ratio using simple shared word It is right, but this method is poor between the associated similitude of the semantic topic document comparison effect.
Summary of the invention
Goal of the invention: in view of the above-mentioned problems, the present invention provides a kind of text based on latent semantic analysis and accidental projection Similarity calculating method changes the limitation of Traditional calculating methods, carries out dimensionality reduction to vector space in the way of a variety of dimensionality reductions, Effectively increase the accuracy and reliability of result.
Technical solution: the present invention proposes a kind of Text similarity computing side based on latent semantic analysis and accidental projection Method includes the following steps:
(1) tag vectorization is obtained into entry label vector collection V1, and be used for TF-IDF algorithm obtain label weight to Quantity set V2;
(2) LSA model M 1 and index database I1 are obtained using LSA algorithm to V2;
(3) RP model M 2 and index database I2 are obtained using accidental projection algorithm to V2;
(4) corpus to be processed is handled using TF-IDF, and carries out LSA and RP processing, obtain final recommendation collection.
Further, label weight vectors collection V2 is obtained in the step (1) specific step is as follows:
(1.1) define D1 be encyclopaedia entry data set, D1=id1, title1, paragraph1, image1, url1, Tag1 }, wherein id1, title1, paragraph1, image1, url1, tag1 respectively indicate number, title, paragraph, picture chain It connects, web page interlinkage and entry label;
(1.2) by obtaining T1={ w using split method to tag1i1,wi2,…,win, wiAIt is encyclopaedia entry data set The A entry tally set, wherein variables A ∈ [1, n];
(1.3) by obtaining dictionary Dict1 using Dictionary method to T1;
(1.4) dictionary Dict1 is saved to local;
(1.5) by obtaining entry label vector collection V1={ v using Doc2Bow method to T1i1,vi2,…,vin, viAIt is The A entry label vector of entry label vector collection V1, wherein variables A ∈ [1, n];
(1.6) entry label weight vectors collection V2={ v is obtained by carrying out TF-IDF method to V1j1,vj2,…,vjn, vjAIt is the A entry label weight vectors of entry label weight vectors collection V2, wherein variables A ∈ [1, n];
(1.7) label weight vectors collection V2 is saved to local.
Further, LSA model M 1 and the specific step of index database I1 are obtained using LSA algorithm to V2 in the step (2) It is rapid as follows:
(2.1) label weight vectors collection V3, V3={ v are loaded into from localk1,vk2,…,vkn, vkBEntry label weight to The B weight vectors of quantity set V3, wherein B ∈ [1, n];
(2.2) dictionary Dict2 is loaded into from local;
(2.3) id2word=Dict2, number of topics num_topics=300 are defined;
(2.4) by the way that V3, using the training of LSA method, incoming parameter id2word and num_topics obtain model M 1;
(2.5) V3 is handled by model M 1 to obtain packaging corpus C1;
(2.6) index database is established to C1 and obtains index database I1;
(2.7) preservation model M1 and index database I1.
Further, the tool of RP model M 2 and index database I2 is obtained using accidental projection algorithm to V2 in the step (3) Steps are as follows for body:
(3.1) label weight vectors collection V4, V4={ v are loaded into from locall1,vl2,…,vln, vlCEntry label weight to The C weight vectors of quantity set V4, wherein C ∈ [1, n];
(3.2) number of topics num_topics=500 is defined;
(3.3) by the way that V4, using the training of RP method, incoming parameter num_topics obtains model M 2;
(3.4) V4 is handled by model M 2 to obtain packaging corpus C2;
(3.5) index database is established to C2 and obtains index database I2;
(3.6) preservation model M2 and index database I2.
Further, final recommendation collection is obtained in the step (4), and specific step is as follows:
(4.1) define D2 be encyclopaedia entry test set, D2=id2, title2, paragraph2, image2, url2, Tag2 }, wherein id2, title2, paragraph2, image2, url2, tag2 respectively indicate number, title, paragraph, picture chain It connects, web page interlinkage and entry label;
(4.2) using title2 as input, by obtaining T2={ w using split method to tag2j1,wj2,…,wjn, wjDIt is encyclopaedia entry the D entry tally set of data set, wherein variables D ∈ [1, n];
(4.3) by obtaining entry label vector collection V5={ v using Doc2Bow method to T2m1,vm2,…,vmn, vmEIt is The E entry label vector of entry label vector collection V5, wherein variable E ∈ [1, n];
(4.4) entry label weight vectors collection V6={ v is obtained by carrying out TF-IDF method to V5o1,vo2,…,von, voFIt is the F entry label weight vectors of entry label weight vectors collection V6, wherein variable F ∈ [1, n];
(4.5) defined variable k=1 is cyclic variable, for traversing V6;
(4.6) definition set R1, R2 and R3, R1={ simi1,simi2,…,simin, R2={ simj1,simj2,…, simjn, R3 is empty set, simiGAnd simjGRespectively indicate the G similarity collection, sim in R1 and R2iGAnd simjGInitial value is It is empty, wherein G ∈ [1, n];
(4.7) LSA model M 3 and accidental projection model M 4 are imported, LSA index database I3 and accidental projection index database I4 is imported;
(4.8) step (4.9) are gone to if k≤n, otherwise go to step (4.14);
(4.9) by vokVec1 is packaged to be using LSA methodk, by vokIt is packed using accidental projection method To vec2k
(4.10) by vec1kSearch index library I3 is calculated and element in I3 and vec1 using cosine similarityk Similarity collection and be stored in simik, by vec2kSearch index library I4 is calculated and element in I4 using cosine similarity With vec2kSimilarity collection and be stored in simjk
(4.11) by simikAnd simjkCorresponding element is averaged to obtain sim after being addedlk
(4.12) by simlkIt is inserted into R3;
(4.13) k=k+1 goes to step (4.8);
(4.14) the highest 8 elements composition set of similarity in each set of R3 is taken to be stored in result set R4, each member in R4 Element is to recommend collection.
The present invention by adopting the above technical scheme, has the advantages that
The present invention carries out text similarity to the entry obtained from Baidupedia, using latent semantic analysis and accidental projection It calculates, this method changes the limitation of Traditional calculating methods, carries out dimensionality reduction to vector space in the way of a variety of dimensionality reductions, effectively Improve the accuracy and reliability of result.
Detailed description of the invention
Fig. 1 is overview flow chart of the invention;
Fig. 2 is the specific flow chart of TF-IDF algorithm in Fig. 1;
Fig. 3 is the specific flow chart of LSA algorithm in Fig. 1;
Fig. 4 is the specific flow chart of accidental projection algorithm in Fig. 1;
Fig. 5 is the specific flow chart of similarity recommendation function in Fig. 1.
Specific embodiment
Combined with specific embodiments below, the present invention is furture elucidated, it should be understood that these embodiments are merely to illustrate the present invention Rather than limit the scope of the invention, after the present invention has been read, those skilled in the art are to various equivalences of the invention The modification of form falls within the application range as defined in the appended claims.
As Figure 1-Figure 5, a kind of text similarity based on latent semantic analysis and accidental projection of the present invention Calculation method includes the following steps:
Step 1: tag vectorization being obtained into entry label vector collection V1, and is used for TF-IDF algorithm and obtains label power Weight vector set V2, method particularly includes:
Step 1.1: definition D1 be encyclopaedia entry data set, D1=id1, title1, paragraph1, image1, Url1, tag1 }, wherein id1, title1, paragraph1, image1, url1, tag1 respectively indicate number, title, paragraph, Image link, web page interlinkage and entry label;
Step 1.2: by obtaining T1={ w using split method to tag1i1,wi2,…,win, wiAIt is encyclopaedia entry number According to the A entry tally set of collection, wherein variables A ∈ [1, n];
Step 1.3: by obtaining dictionary Dict1 using Dictionary method to T1;
Step 1.4: dictionary Dict1 is saved to local;
Step 1.5: by obtaining entry label vector collection V1={ v using Doc2Bow method to T1i1,vi2,…,vin, viAIt is the A entry label vector of entry label vector collection V1, wherein variables A ∈ [1, n];
Step 1.6: obtaining entry label weight vectors collection V2={ v by carrying out TF-IDF method to V1j1,vj2,…, vjn, vjAIt is the A entry label weight vectors of entry label weight vectors collection V2, wherein variables A ∈ [1, n];
Step 1.7: label weight vectors collection V2 is saved to local.
Step 2: LSA model M 1 and index database I1 are obtained using LSA algorithm to V2, method particularly includes:
Step 2.1: being loaded into label weight vectors collection V3, V3={ v from localk1,vk2,…,vkn, vkBIt is entry label power The B weight vectors of weight vector set V3, wherein B ∈ [1, n];
Step 2.2: being loaded into dictionary Dict2 from local;
Step 2.3: defining id2word=Dict2, number of topics num_topics=300;
Step 2.4: by the way that V3, using the training of LSA method, incoming parameter id2word and num_topics obtain model M1;
Step 2.5: V3 being handled by model M 1 to obtain packaging corpus C1;
Step 2.6: index database being established to C1 and obtains index database I1;
Step 2.7: preservation model M1 and index database I1.
Step 3: RP model M 2 and index database I2 are obtained using accidental projection algorithm to V2, method particularly includes:
Step 3.1: being loaded into label weight vectors collection V4, V4={ v from locall1,vl2,…,vln, vlCIt is entry label power The C weight vectors of weight vector set V4, wherein C ∈ [1, n];
Step 3.2: defining number of topics num_topics=500;
Step 3.3: by the way that V4, using the training of RP method, incoming parameter num_topics obtains model M 2;
Step 3.4: V4 being handled by model M 2 to obtain packaging corpus C2;
Step 3.5: index database being established to C2 and obtains index database I2;
Step 3.6: preservation model M2 and index database I2.
Step 4: corpus to be processed is handled using TF-IDF, and carries out LSA and RP processing, obtains final recommendation collection, Method particularly includes:
Step 4.1: definition D2 be encyclopaedia entry test set, D2=id2, title2, paragraph2, image2, Url2, tag2 }, wherein id2, title2, paragraph2, image2, url2, tag2 respectively indicate number, title, paragraph, Image link, web page interlinkage and entry label;
Step 4.2: using title2 as input, by obtaining T2={ w using split method to tag2j1,wj2,…, wjn, wjDIt is encyclopaedia entry the D entry tally set of data set, wherein variables D ∈ [1, n];
Step 4.3: by obtaining entry label vector collection V5={ v using Doc2Bow method to T2m1,vm2,…,vmn, vmEIt is the E entry label vector of entry label vector collection V5, wherein variable E ∈ [1, n];
Step 4.4: obtaining entry label weight vectors collection V6={ v by carrying out TF-IDF method to V5o1,vo2,…, von, voFIt is the F entry label weight vectors of entry label weight vectors collection V6, wherein variable F ∈ [1, n];
Step 4.5: defined variable k=1 is cyclic variable, for traversing V6;
Step 4.6: definition set R1, R2 and R3, R1={ simi1,simi2,…,simin, R2={ simj1, simj2,…,simjn, R3 is empty set, simiGAnd simjGRespectively indicate the G similarity collection, sim in R1 and R2iGAnd simjGJust Initial value is sky, wherein G ∈ [1, n];
Step 4.7: importing LSA model M 3 and accidental projection model M 4, import LSA index database I3 and accidental projection index database I4;
Step 4.8: going to step 4.9 if k≤n, otherwise go to step 4.14;
Step 4.9: by vokVec1 is packaged to be using LSA methodk, by vokIt is packed using accidental projection method Obtain vec2k
Step 4.10: by vec1kSearch index library I3, using cosine similarity be calculated with element in I3 with vec1kSimilarity collection and be stored in simik, by vec2kSearch index library I4, is calculated and I4 using cosine similarity Middle element and vec2kSimilarity collection and be stored in simjk
Step 4.11: by simikAnd simjkCorresponding element is averaged to obtain sim after being addedlk
Step 4.12: by simlkIt is inserted into R3;
Step 4.13:k=k+1, goes to step 4.8;
Step 4.14: taking each in the highest 8 elements composition set deposit result set R4, R4 of similarity in each set of R3 Element is to recommend collection.
The following are argument tables according to the present invention:
1 global variable table of table
Variable-definition Name variable
tag Entry label
V1 Entry label vector collection
V2 Label weight vectors collection
M1 LSA model
I1 LSA index database
M2 RP model
I2 RP index database
2 step 1 argument table of table
Variable-definition Name variable
D1 Encyclopaedia entry data set
id1 Entry data set number
title1 Entry data set title
paragraph1 Entry data set paragraph
image1 Entry data set image link
url1 Entry data set web page interlinkage
tag1 Entry label
T1 The set of entry tally set
wiA The A entry tally set in the set of entry tally set
Dict1 Dictionary
viA The A entry label vector of V1
vjA The A entry label weight vectors of V2
3 step 2 argument table of table
4 step 3 argument table of table
Variable-definition Name variable
V4 Label weight vectors collection
vlC The C weight vectors of V4
num_topics The parameter 1 of RP method
C2 The corpus of RP packaging is carried out to V4
num_topics The parameter 2 of LSA method
C1 The corpus of LSA packaging is carried out to V3
5 step 4 argument table of table
In order to illustrate the validity of method proposed by the present invention, by carrying out data processing to 46935 history entries, Space projection is carried out to entry label using latent semantic analysis and accidental projection and calculates result using cosine similarity.It should Method effectively overcomes the limitation of traditional text similarity calculation, is carried out by two kinds of algorithm combinations to Text similarity computing Optimization.
The present invention can be in conjunction with computer system, to complete Text similarity computing and then export to recommend collection.
The invention propose it is a kind of latent semantic analysis method is combined with projecting method immediately, to text It is projected, obtains best recommendation results.
It is sub that the above description is only an embodiment of the present invention, is not intended to restrict the invention.It is all in principle of the invention Within, made equivalent replacement should all be included in the protection scope of the present invention.The content that the present invention is not elaborated is answered Belong to prior art well known to this professional domain technical staff.

Claims (5)

1. a kind of Text similarity computing method based on latent semantic analysis and accidental projection, which is characterized in that including as follows Step:
(1) tag vectorization is obtained into entry label vector collection V1, and is used for TF-IDF algorithm and obtains label weight vectors collection V2;
(2) LSA model M 1 and index database I1 are obtained using LSA algorithm to V2;
(3) RP model M 2 and index database I2 are obtained using accidental projection algorithm to V2;
(4) corpus to be processed is handled using TF-IDF, and carries out LSA and RP processing, obtain final recommendation collection.
2. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1, It is characterized in that, obtaining label weight vectors collection V2 in the step (1), specific step is as follows:
(1.1) define D1 be encyclopaedia entry data set, D1={ id1, title1, paragraph1, image1, url1, tag1 }, Wherein id1, title1, paragraph1, image1, url1, tag1 respectively indicate number, title, paragraph, image link, net Page link and entry label;
(1.2) by obtaining T1={ w using split method to tag1i1,wi2,…,win, wiAIt is encyclopaedia entry data set A A entry tally set, wherein variables A ∈ [1, n];
(1.3) by obtaining dictionary Dict1 using Dictionary method to T1;
(1.4) dictionary Dict1 is saved to local;
(1.5) by obtaining entry label vector collection V1={ v using Doc2Bow method to T1i1,vi2,…,vin, viAIt is entry The A entry label vector of label vector collection V1, wherein variables A ∈ [1, n];
(1.6) entry label weight vectors collection V2={ v is obtained by carrying out TF-IDF method to V1j1,vj2,…,vjn, vjAIt is The A entry label weight vectors of entry label weight vectors collection V2, wherein variables A ∈ [1, n];
(1.7) label weight vectors collection V2 is saved to local.
3. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1, It is characterized in that, using LSA algorithm obtaining LSA model M 1 and index database I1 to V2 in the step (2), specific step is as follows:
(2.1) label weight vectors collection V3, V3={ v are loaded into from localk1,vk2,…,vkn, vkBIt is entry label weight vectors collection The B weight vectors of V3, wherein B ∈ [1, n];
(2.2) dictionary Dict2 is loaded into from local;
(2.3) id2word=Dict2, number of topics num_topics=300 are defined;
(2.4) by the way that V3, using the training of LSA method, incoming parameter id2word and num_topics obtain model M 1;
(2.5) V3 is handled by model M 1 to obtain packaging corpus C1;
(2.6) index database is established to C1 and obtains index database I1;
(2.7) preservation model M1 and index database I1.
4. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1, It is characterized in that, obtaining the specific steps of RP model M 2 and index database I2 using accidental projection algorithm to V2 in the step (3) It is as follows:
(3.1) label weight vectors collection V4, V4={ v are loaded into from locall1,vl2,…,vln, vlCIt is entry label weight vectors collection The C weight vectors of V4, wherein C ∈ [1, n];
(3.2) number of topics num_topics=500 is defined;
(3.3) by the way that V4, using the training of RP method, incoming parameter num_topics obtains model M 2;
(3.4) V4 is handled by model M 2 to obtain packaging corpus C2;
(3.5) index database is established to C2 and obtains index database I2;
(3.6) preservation model M2 and index database I2.
5. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1, It is characterized in that, obtaining final recommendation collection in the step (4), specific step is as follows:
(4.1) define D2 be encyclopaedia entry test set, D2={ id2, title2, paragraph2, image2, url2, tag2 }, Wherein id2, title2, paragraph2, image2, url2, tag2 respectively indicate number, title, paragraph, image link, net Page link and entry label;
(4.2) using title2 as input, by obtaining T2={ w using split method to tag2j1,wj2,…,wjn, wjDIt is Encyclopaedia entry the D entry tally set of data set, wherein variables D ∈ [1, n];
(4.3) by obtaining entry label vector collection V5={ v using Doc2Bow method to T2m1,vm2,…,vmn, vmEIt is entry The E entry label vector of label vector collection V5, wherein variable E ∈ [1, n];
(4.4) entry label weight vectors collection V6={ v is obtained by carrying out TF-IDF method to V5o1,vo2,…,von, voFIt is The F entry label weight vectors of entry label weight vectors collection V6, wherein variable F ∈ [1, n];
(4.5) defined variable k=1 is cyclic variable, for traversing V6;
(4.6) definition set R1, R2 and R3, R1={ simi1,simi2,…,simin, R2={ simj1,simj2,…,simjn, R3 is empty set, simiGAnd simjGRespectively indicate the G similarity collection, sim in R1 and R2iGAnd simjGInitial value is sky, wherein G ∈[1,n];
(4.7) LSA model M 3 and accidental projection model M 4 are imported, LSA index database I3 and accidental projection index database I4 is imported;
(4.8) step (4.9) are gone to if k≤n, otherwise go to step (4.14);
(4.9) by vokVec1 is packaged to be using LSA methodk, by vokIt is packaged to be using accidental projection method vec2k
(4.10) by vec1kSearch index library I3 is calculated and element in I3 and vec1 using cosine similaritykPhase Collect like degree and is stored in simik, by vec2kSearch index library I4, using cosine similarity be calculated with element in I4 with vec2kSimilarity collection and be stored in simjk
(4.11) by simikAnd simjkCorresponding element is averaged to obtain sim after being addedlk
(4.12) by simlkIt is inserted into R3;
(4.13) k=k+1 goes to step (4.8);
(4.14) the highest 8 elements composition set of similarity in each set of R3 is taken to be stored in result set R4, each element is in R4 To recommend collection.
CN201910598004.7A 2019-07-04 2019-07-04 Text similarity calculation method based on latent semantic analysis and random projection Active CN110399458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910598004.7A CN110399458B (en) 2019-07-04 2019-07-04 Text similarity calculation method based on latent semantic analysis and random projection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910598004.7A CN110399458B (en) 2019-07-04 2019-07-04 Text similarity calculation method based on latent semantic analysis and random projection

Publications (2)

Publication Number Publication Date
CN110399458A true CN110399458A (en) 2019-11-01
CN110399458B CN110399458B (en) 2023-05-26

Family

ID=68323669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910598004.7A Active CN110399458B (en) 2019-07-04 2019-07-04 Text similarity calculation method based on latent semantic analysis and random projection

Country Status (1)

Country Link
CN (1) CN110399458B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581378A (en) * 2020-04-28 2020-08-25 中国工商银行股份有限公司 Method and device for establishing user consumption label system based on transaction data
CN112884053A (en) * 2021-02-28 2021-06-01 江苏匠算天诚信息科技有限公司 Website classification method, system, equipment and medium based on image-text mixed characteristics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273474A (en) * 2017-06-08 2017-10-20 成都数联铭品科技有限公司 Autoabstract abstracting method and system based on latent semantic analysis
CN109165382A (en) * 2018-08-03 2019-01-08 南京工业大学 A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273474A (en) * 2017-06-08 2017-10-20 成都数联铭品科技有限公司 Autoabstract abstracting method and system based on latent semantic analysis
CN109165382A (en) * 2018-08-03 2019-01-08 南京工业大学 A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JESSICA LIN 等: "Dimensionality Reduction by Random Projection and Latent Semantic Indexing", 《CITESEER》 *
汪瑾: "基于潜在语义分析的程序代码相似度检测", 《科技创新与应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581378A (en) * 2020-04-28 2020-08-25 中国工商银行股份有限公司 Method and device for establishing user consumption label system based on transaction data
CN111581378B (en) * 2020-04-28 2024-04-26 中国工商银行股份有限公司 Method and device for establishing user consumption label system based on transaction data
CN112884053A (en) * 2021-02-28 2021-06-01 江苏匠算天诚信息科技有限公司 Website classification method, system, equipment and medium based on image-text mixed characteristics
CN112884053B (en) * 2021-02-28 2022-04-15 江苏匠算天诚信息科技有限公司 Website classification method, system, equipment and medium based on image-text mixed characteristics

Also Published As

Publication number Publication date
CN110399458B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
CN109815308B (en) Method and device for determining intention recognition model and method and device for searching intention recognition
CN107705066B (en) Information input method and electronic equipment during commodity warehousing
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN104636402B (en) A kind of classification of business object, search, method for pushing and system
CN106446148A (en) Cluster-based text duplicate checking method
CN106156272A (en) A kind of information retrieval method based on multi-source semantic analysis
Sarawagi et al. Open-domain quantity queries on web tables: annotation, response, and consensus models
CN103049433A (en) Automatic question answering method, automatic question answering system and method for constructing question answering case base
CN101206674A (en) Enhancement type related search system and method using commercial articles as medium
Zhang et al. Continuous word embeddings for detecting local text reuses at the semantic level
CN109840325A (en) Text semantic method for measuring similarity based on mutual information
CN110399458A (en) A kind of Text similarity computing method based on latent semantic analysis and accidental projection
CN107656920A (en) A kind of skilled personnel based on patent recommend method
CN104679784A (en) O2B intelligent searching method and system
CN115115049A (en) Neural network model training method, apparatus, device, medium, and program product
Meng et al. Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection
Senthilkumar et al. A Survey On Feature Selection Method For Product Review
CN111581378A (en) Method and device for establishing user consumption label system based on transaction data
Sato et al. Text classification and transfer learning based on character-level deep convolutional neural networks
CN114153965A (en) Content and map combined public opinion event recommendation method, system and terminal
Yu et al. Computer Image Content Retrieval considering K‐Means Clustering Algorithm
Pang A personalized recommendation algorithm for semantic classification of new book recommendation services for university libraries
Prathyusha et al. Normalization Methods for Multiple Sources of Data
CN114022233A (en) Novel commodity recommendation method
Dastgheib et al. Persian text classification enhancement by latent semantic space

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20191101

Assignee: Fanyun software (Nanjing) Co.,Ltd.

Assignor: HUAIYIN INSTITUTE OF TECHNOLOGY

Contract record no.: X2023980052895

Denomination of invention: A Text Similarity Calculation Method Based on Latent Semantic Analysis and Random Projection

Granted publication date: 20230526

License type: Common License

Record date: 20231219