CN110399458A - A kind of Text similarity computing method based on latent semantic analysis and accidental projection - Google Patents
A kind of Text similarity computing method based on latent semantic analysis and accidental projection Download PDFInfo
- Publication number
- CN110399458A CN110399458A CN201910598004.7A CN201910598004A CN110399458A CN 110399458 A CN110399458 A CN 110399458A CN 201910598004 A CN201910598004 A CN 201910598004A CN 110399458 A CN110399458 A CN 110399458A
- Authority
- CN
- China
- Prior art keywords
- entry
- collection
- sim
- weight vectors
- index database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 19
- 238000004364 calculation method Methods 0.000 title claims abstract description 12
- 239000013598 vector Substances 0.000 claims abstract description 75
- 238000000034 method Methods 0.000 claims abstract description 62
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 27
- 238000012545 processing Methods 0.000 claims abstract description 6
- 238000004806 packaging method and process Methods 0.000 claims description 8
- 238000004321 preservation Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000005303 weighing Methods 0.000 abstract 1
- 230000009467 reduction Effects 0.000 description 6
- 238000000605 extraction Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 235000011511 Diospyros Nutrition 0.000 description 1
- 244000236655 Diospyros kaki Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000004377 microelectronic Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The Text similarity computing method based on latent semantic analysis and accidental projection that the invention discloses a kind of, suitable for universal Unsupervised text clustering problem.The present invention is converted to bag of words to label text to be processed first, and carries out assigning to it weighing to operate using TF-IDF algorithm and obtain weight vectors collection.Then weight vectors collection is handled using LSA algorithm to obtain LSA index database, then weight vectors collection is handled to obtain RP index database using accidental projection algorithm.It after corpus to be calculated is finally carried out TF-IDF processing, then is compared respectively using after LSA algorithm and RP algorithm process with index database content, obtains text similarity.The present invention can carry out effective similarity calculation to content of text, and carry out related content recommendation by the higher text of similarity.
Description
Technical field
The invention belongs to natural language processing field, in particular to a kind of text based on latent semantic analysis and accidental projection
This similarity calculating method.
Background technique
In traditional text proposed algorithm, researchers can select to carry out by common vector space model etc. similar
Degree calculates.By semantic association between excavation text, provided reliably in conjunction with latent semantic analysis and accidental projection for related system
Text similarity computing method.
The existing Research foundation of Zhu Quanyin et al. includes: Wanli Feng.Research of theme statement
extraction for chinese literature based on lexical chain.International
Journal of Multimedia and Ubiquitous Engineering,Vol.11,No.6(2016),pp.379-
388;Wanli Feng,Ying Li,Shangbing Gao,Yunyang Yan,Jianxun Xue.A novel flame
edge detection algorithm via a novel active contour model.International
Journal of Hybrid Information Technology,Vol.9,No.9(2016),pp.275-282;Liu Jinling,
Method for mode matching [J] microelectronics and computer of the Feng Wanli based on Feature Dependence relationship, 2011,28 (12): 167-
170;Liu Jinling, Feng Wanli, Zhang Yahong initialize cluster class center and reconstruct text cluster [J] computer application of scaling function
Research, 2011,28 (11): 4115-4117;Chinese short message text of Liu Jinling, Feng Wanli, the Zhang Yahong based on scale again is poly-
Class method [J] computer engineering and application, 2012,48 (21): 146-150.;Zhu Quanyin, Pan Lu, Liu Wenru wait .Web scientific and technological
News category extraction algorithm [J] Huaiyingong College journal, 2015,24 (5): 18-24;Li Xiang, Zhu Quanyin joint are clustered and are commented
Shared collaborative filtering recommending [J] the computer science of sub-matrix and exploration, 2014,8 (6): 751-759;Quanyin Zhu,
Sunqun Cao.A Novel Classifier-independent Feature Selection Algorithm for
Imbalanced Datasets.2009,p:77-82;Quanyin Zhu,Yunyang Yan,Jin Ding,Jin
Qian.The Case Study for Price Extracting of Mobile Phone Sell Online.2011,p:
282-285;Quanyin Zhu,Suqun Cao,Pei Zhou,Yunyang Yan,Hong Zhou.Integrated Price
Forecast based on Dichotomy Backfilling and Disturbance Factor
Algorithm.International Review on Computers and Software,2011,Vol.6(6):1089-
1093;Zhu Quanyin, Feng Wanli et al. application, openly with the related patents of authorization: a kind of intelligence of Feng Wanli, Shao Heshuai, Zhuan Jun is cold
Hide car state monitoring wireless network terminal installation: CN203616634U [P] .2014;Zhu Quanyin, Hu Rongjing, He Suqun, Zhou Pei
A kind of equal price forecasting of commodity method Chinese patent based on linear interpolation Yu Adaptive windowing mouth of: ZL 2,011 1
0423015.5,2015.07.01;Zhu Quanyin, Cao Suqun, Yan Yunyang, Hu Rong wait quietly, and one kind is repaired and upset based on two divided datas
The price forecasting of commodity method Chinese patent of the factor: 2,011 1 0422274.6,2013.01.02 of ZL;Li Xiang, Zhu Quanyin, recklessly
A kind of Cold Chain Logistics prestowage intelligent recommendation method China Patent Publication No. based on spectral clustering of Rong Lin, Zhou Hong:
CN105654267A,2016.06.08;Zhu Quanyin, Xin Cheng, Li Xiang, Xu Kang et al., one kind are tested based on K-means and LDA is two-way
The network behavior of card is accustomed to clustering method China Patent Publication No.: CN 106202480 A, 2016.12.07;Zhu Quanyin, Tanghai
Wave, Yan Yunyang, Li Xiang, Hu Ronglin, Qu Xuexin, Shaowu is outstanding, Xu Kang, Zhaoyang, a kind of use based on deep learning of Qian Kai, Gaoyang
Family literature reading interest analysis method China Patent Publication No.: CN108280114A, 2018.07.13;Zhu Quanyin, in the persimmon people,
A kind of expert of knowledge based map of Hu Ronglin, Feng Wanli, Zhou Hong combines recommended method China Patent Publication No.:
CN109062961A,2018.12.21。
Latent semantic analysis:
Latent semantic analysis is a kind of new information retrieval algebraic model, is the computational theory for knowledge acquisition and displaying
And method, it analyzes a large amount of text set using the method that statistics calculates, to extract potential between word and word
Semantic structure, and indicate word and text with this potential semantic structure, reach the correlation eliminated between word and simplify text
The purpose of this vector realization dimensionality reduction.
Accidental projection:
Accidental projection is that a kind of simple and effective dimension about subtracts method, it is different from some feature extraction sides based on presentation
Method.Feature extraction is totally independent of raw sample data set, not will cause the great distortion of data.Data after dimensionality reduction are still
The important feature information that original high dimensional data is included is remained, and needs not move through matrix decomposition to acquire transformed matrix,
The ability of real-time processing data can be greatly improved.
When comparing problem towards traditional text similarity, has paper and mainly carry out similarity ratio using simple shared word
It is right, but this method is poor between the associated similitude of the semantic topic document comparison effect.
Summary of the invention
Goal of the invention: in view of the above-mentioned problems, the present invention provides a kind of text based on latent semantic analysis and accidental projection
Similarity calculating method changes the limitation of Traditional calculating methods, carries out dimensionality reduction to vector space in the way of a variety of dimensionality reductions,
Effectively increase the accuracy and reliability of result.
Technical solution: the present invention proposes a kind of Text similarity computing side based on latent semantic analysis and accidental projection
Method includes the following steps:
(1) tag vectorization is obtained into entry label vector collection V1, and be used for TF-IDF algorithm obtain label weight to
Quantity set V2;
(2) LSA model M 1 and index database I1 are obtained using LSA algorithm to V2;
(3) RP model M 2 and index database I2 are obtained using accidental projection algorithm to V2;
(4) corpus to be processed is handled using TF-IDF, and carries out LSA and RP processing, obtain final recommendation collection.
Further, label weight vectors collection V2 is obtained in the step (1) specific step is as follows:
(1.1) define D1 be encyclopaedia entry data set, D1=id1, title1, paragraph1, image1, url1,
Tag1 }, wherein id1, title1, paragraph1, image1, url1, tag1 respectively indicate number, title, paragraph, picture chain
It connects, web page interlinkage and entry label;
(1.2) by obtaining T1={ w using split method to tag1i1,wi2,…,win, wiAIt is encyclopaedia entry data set
The A entry tally set, wherein variables A ∈ [1, n];
(1.3) by obtaining dictionary Dict1 using Dictionary method to T1;
(1.4) dictionary Dict1 is saved to local;
(1.5) by obtaining entry label vector collection V1={ v using Doc2Bow method to T1i1,vi2,…,vin, viAIt is
The A entry label vector of entry label vector collection V1, wherein variables A ∈ [1, n];
(1.6) entry label weight vectors collection V2={ v is obtained by carrying out TF-IDF method to V1j1,vj2,…,vjn,
vjAIt is the A entry label weight vectors of entry label weight vectors collection V2, wherein variables A ∈ [1, n];
(1.7) label weight vectors collection V2 is saved to local.
Further, LSA model M 1 and the specific step of index database I1 are obtained using LSA algorithm to V2 in the step (2)
It is rapid as follows:
(2.1) label weight vectors collection V3, V3={ v are loaded into from localk1,vk2,…,vkn, vkBEntry label weight to
The B weight vectors of quantity set V3, wherein B ∈ [1, n];
(2.2) dictionary Dict2 is loaded into from local;
(2.3) id2word=Dict2, number of topics num_topics=300 are defined;
(2.4) by the way that V3, using the training of LSA method, incoming parameter id2word and num_topics obtain model M 1;
(2.5) V3 is handled by model M 1 to obtain packaging corpus C1;
(2.6) index database is established to C1 and obtains index database I1;
(2.7) preservation model M1 and index database I1.
Further, the tool of RP model M 2 and index database I2 is obtained using accidental projection algorithm to V2 in the step (3)
Steps are as follows for body:
(3.1) label weight vectors collection V4, V4={ v are loaded into from locall1,vl2,…,vln, vlCEntry label weight to
The C weight vectors of quantity set V4, wherein C ∈ [1, n];
(3.2) number of topics num_topics=500 is defined;
(3.3) by the way that V4, using the training of RP method, incoming parameter num_topics obtains model M 2;
(3.4) V4 is handled by model M 2 to obtain packaging corpus C2;
(3.5) index database is established to C2 and obtains index database I2;
(3.6) preservation model M2 and index database I2.
Further, final recommendation collection is obtained in the step (4), and specific step is as follows:
(4.1) define D2 be encyclopaedia entry test set, D2=id2, title2, paragraph2, image2, url2,
Tag2 }, wherein id2, title2, paragraph2, image2, url2, tag2 respectively indicate number, title, paragraph, picture chain
It connects, web page interlinkage and entry label;
(4.2) using title2 as input, by obtaining T2={ w using split method to tag2j1,wj2,…,wjn,
wjDIt is encyclopaedia entry the D entry tally set of data set, wherein variables D ∈ [1, n];
(4.3) by obtaining entry label vector collection V5={ v using Doc2Bow method to T2m1,vm2,…,vmn, vmEIt is
The E entry label vector of entry label vector collection V5, wherein variable E ∈ [1, n];
(4.4) entry label weight vectors collection V6={ v is obtained by carrying out TF-IDF method to V5o1,vo2,…,von,
voFIt is the F entry label weight vectors of entry label weight vectors collection V6, wherein variable F ∈ [1, n];
(4.5) defined variable k=1 is cyclic variable, for traversing V6;
(4.6) definition set R1, R2 and R3, R1={ simi1,simi2,…,simin, R2={ simj1,simj2,…,
simjn, R3 is empty set, simiGAnd simjGRespectively indicate the G similarity collection, sim in R1 and R2iGAnd simjGInitial value is
It is empty, wherein G ∈ [1, n];
(4.7) LSA model M 3 and accidental projection model M 4 are imported, LSA index database I3 and accidental projection index database I4 is imported;
(4.8) step (4.9) are gone to if k≤n, otherwise go to step (4.14);
(4.9) by vokVec1 is packaged to be using LSA methodk, by vokIt is packed using accidental projection method
To vec2k;
(4.10) by vec1kSearch index library I3 is calculated and element in I3 and vec1 using cosine similarityk
Similarity collection and be stored in simik, by vec2kSearch index library I4 is calculated and element in I4 using cosine similarity
With vec2kSimilarity collection and be stored in simjk;
(4.11) by simikAnd simjkCorresponding element is averaged to obtain sim after being addedlk;
(4.12) by simlkIt is inserted into R3;
(4.13) k=k+1 goes to step (4.8);
(4.14) the highest 8 elements composition set of similarity in each set of R3 is taken to be stored in result set R4, each member in R4
Element is to recommend collection.
The present invention by adopting the above technical scheme, has the advantages that
The present invention carries out text similarity to the entry obtained from Baidupedia, using latent semantic analysis and accidental projection
It calculates, this method changes the limitation of Traditional calculating methods, carries out dimensionality reduction to vector space in the way of a variety of dimensionality reductions, effectively
Improve the accuracy and reliability of result.
Detailed description of the invention
Fig. 1 is overview flow chart of the invention;
Fig. 2 is the specific flow chart of TF-IDF algorithm in Fig. 1;
Fig. 3 is the specific flow chart of LSA algorithm in Fig. 1;
Fig. 4 is the specific flow chart of accidental projection algorithm in Fig. 1;
Fig. 5 is the specific flow chart of similarity recommendation function in Fig. 1.
Specific embodiment
Combined with specific embodiments below, the present invention is furture elucidated, it should be understood that these embodiments are merely to illustrate the present invention
Rather than limit the scope of the invention, after the present invention has been read, those skilled in the art are to various equivalences of the invention
The modification of form falls within the application range as defined in the appended claims.
As Figure 1-Figure 5, a kind of text similarity based on latent semantic analysis and accidental projection of the present invention
Calculation method includes the following steps:
Step 1: tag vectorization being obtained into entry label vector collection V1, and is used for TF-IDF algorithm and obtains label power
Weight vector set V2, method particularly includes:
Step 1.1: definition D1 be encyclopaedia entry data set, D1=id1, title1, paragraph1, image1,
Url1, tag1 }, wherein id1, title1, paragraph1, image1, url1, tag1 respectively indicate number, title, paragraph,
Image link, web page interlinkage and entry label;
Step 1.2: by obtaining T1={ w using split method to tag1i1,wi2,…,win, wiAIt is encyclopaedia entry number
According to the A entry tally set of collection, wherein variables A ∈ [1, n];
Step 1.3: by obtaining dictionary Dict1 using Dictionary method to T1;
Step 1.4: dictionary Dict1 is saved to local;
Step 1.5: by obtaining entry label vector collection V1={ v using Doc2Bow method to T1i1,vi2,…,vin,
viAIt is the A entry label vector of entry label vector collection V1, wherein variables A ∈ [1, n];
Step 1.6: obtaining entry label weight vectors collection V2={ v by carrying out TF-IDF method to V1j1,vj2,…,
vjn, vjAIt is the A entry label weight vectors of entry label weight vectors collection V2, wherein variables A ∈ [1, n];
Step 1.7: label weight vectors collection V2 is saved to local.
Step 2: LSA model M 1 and index database I1 are obtained using LSA algorithm to V2, method particularly includes:
Step 2.1: being loaded into label weight vectors collection V3, V3={ v from localk1,vk2,…,vkn, vkBIt is entry label power
The B weight vectors of weight vector set V3, wherein B ∈ [1, n];
Step 2.2: being loaded into dictionary Dict2 from local;
Step 2.3: defining id2word=Dict2, number of topics num_topics=300;
Step 2.4: by the way that V3, using the training of LSA method, incoming parameter id2word and num_topics obtain model
M1;
Step 2.5: V3 being handled by model M 1 to obtain packaging corpus C1;
Step 2.6: index database being established to C1 and obtains index database I1;
Step 2.7: preservation model M1 and index database I1.
Step 3: RP model M 2 and index database I2 are obtained using accidental projection algorithm to V2, method particularly includes:
Step 3.1: being loaded into label weight vectors collection V4, V4={ v from locall1,vl2,…,vln, vlCIt is entry label power
The C weight vectors of weight vector set V4, wherein C ∈ [1, n];
Step 3.2: defining number of topics num_topics=500;
Step 3.3: by the way that V4, using the training of RP method, incoming parameter num_topics obtains model M 2;
Step 3.4: V4 being handled by model M 2 to obtain packaging corpus C2;
Step 3.5: index database being established to C2 and obtains index database I2;
Step 3.6: preservation model M2 and index database I2.
Step 4: corpus to be processed is handled using TF-IDF, and carries out LSA and RP processing, obtains final recommendation collection,
Method particularly includes:
Step 4.1: definition D2 be encyclopaedia entry test set, D2=id2, title2, paragraph2, image2,
Url2, tag2 }, wherein id2, title2, paragraph2, image2, url2, tag2 respectively indicate number, title, paragraph,
Image link, web page interlinkage and entry label;
Step 4.2: using title2 as input, by obtaining T2={ w using split method to tag2j1,wj2,…,
wjn, wjDIt is encyclopaedia entry the D entry tally set of data set, wherein variables D ∈ [1, n];
Step 4.3: by obtaining entry label vector collection V5={ v using Doc2Bow method to T2m1,vm2,…,vmn,
vmEIt is the E entry label vector of entry label vector collection V5, wherein variable E ∈ [1, n];
Step 4.4: obtaining entry label weight vectors collection V6={ v by carrying out TF-IDF method to V5o1,vo2,…,
von, voFIt is the F entry label weight vectors of entry label weight vectors collection V6, wherein variable F ∈ [1, n];
Step 4.5: defined variable k=1 is cyclic variable, for traversing V6;
Step 4.6: definition set R1, R2 and R3, R1={ simi1,simi2,…,simin, R2={ simj1,
simj2,…,simjn, R3 is empty set, simiGAnd simjGRespectively indicate the G similarity collection, sim in R1 and R2iGAnd simjGJust
Initial value is sky, wherein G ∈ [1, n];
Step 4.7: importing LSA model M 3 and accidental projection model M 4, import LSA index database I3 and accidental projection index database
I4;
Step 4.8: going to step 4.9 if k≤n, otherwise go to step 4.14;
Step 4.9: by vokVec1 is packaged to be using LSA methodk, by vokIt is packed using accidental projection method
Obtain vec2k;
Step 4.10: by vec1kSearch index library I3, using cosine similarity be calculated with element in I3 with
vec1kSimilarity collection and be stored in simik, by vec2kSearch index library I4, is calculated and I4 using cosine similarity
Middle element and vec2kSimilarity collection and be stored in simjk;
Step 4.11: by simikAnd simjkCorresponding element is averaged to obtain sim after being addedlk;
Step 4.12: by simlkIt is inserted into R3;
Step 4.13:k=k+1, goes to step 4.8;
Step 4.14: taking each in the highest 8 elements composition set deposit result set R4, R4 of similarity in each set of R3
Element is to recommend collection.
The following are argument tables according to the present invention:
1 global variable table of table
Variable-definition | Name variable |
tag | Entry label |
V1 | Entry label vector collection |
V2 | Label weight vectors collection |
M1 | LSA model |
I1 | LSA index database |
M2 | RP model |
I2 | RP index database |
2 step 1 argument table of table
Variable-definition | Name variable |
D1 | Encyclopaedia entry data set |
id1 | Entry data set number |
title1 | Entry data set title |
paragraph1 | Entry data set paragraph |
image1 | Entry data set image link |
url1 | Entry data set web page interlinkage |
tag1 | Entry label |
T1 | The set of entry tally set |
wiA | The A entry tally set in the set of entry tally set |
Dict1 | Dictionary |
viA | The A entry label vector of V1 |
vjA | The A entry label weight vectors of V2 |
3 step 2 argument table of table
4 step 3 argument table of table
Variable-definition | Name variable |
V4 | Label weight vectors collection |
vlC | The C weight vectors of V4 |
num_topics | The parameter 1 of RP method |
C2 | The corpus of RP packaging is carried out to V4 |
num_topics | The parameter 2 of LSA method |
C1 | The corpus of LSA packaging is carried out to V3 |
5 step 4 argument table of table
In order to illustrate the validity of method proposed by the present invention, by carrying out data processing to 46935 history entries,
Space projection is carried out to entry label using latent semantic analysis and accidental projection and calculates result using cosine similarity.It should
Method effectively overcomes the limitation of traditional text similarity calculation, is carried out by two kinds of algorithm combinations to Text similarity computing
Optimization.
The present invention can be in conjunction with computer system, to complete Text similarity computing and then export to recommend collection.
The invention propose it is a kind of latent semantic analysis method is combined with projecting method immediately, to text
It is projected, obtains best recommendation results.
It is sub that the above description is only an embodiment of the present invention, is not intended to restrict the invention.It is all in principle of the invention
Within, made equivalent replacement should all be included in the protection scope of the present invention.The content that the present invention is not elaborated is answered
Belong to prior art well known to this professional domain technical staff.
Claims (5)
1. a kind of Text similarity computing method based on latent semantic analysis and accidental projection, which is characterized in that including as follows
Step:
(1) tag vectorization is obtained into entry label vector collection V1, and is used for TF-IDF algorithm and obtains label weight vectors collection
V2;
(2) LSA model M 1 and index database I1 are obtained using LSA algorithm to V2;
(3) RP model M 2 and index database I2 are obtained using accidental projection algorithm to V2;
(4) corpus to be processed is handled using TF-IDF, and carries out LSA and RP processing, obtain final recommendation collection.
2. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1,
It is characterized in that, obtaining label weight vectors collection V2 in the step (1), specific step is as follows:
(1.1) define D1 be encyclopaedia entry data set, D1={ id1, title1, paragraph1, image1, url1, tag1 },
Wherein id1, title1, paragraph1, image1, url1, tag1 respectively indicate number, title, paragraph, image link, net
Page link and entry label;
(1.2) by obtaining T1={ w using split method to tag1i1,wi2,…,win, wiAIt is encyclopaedia entry data set A
A entry tally set, wherein variables A ∈ [1, n];
(1.3) by obtaining dictionary Dict1 using Dictionary method to T1;
(1.4) dictionary Dict1 is saved to local;
(1.5) by obtaining entry label vector collection V1={ v using Doc2Bow method to T1i1,vi2,…,vin, viAIt is entry
The A entry label vector of label vector collection V1, wherein variables A ∈ [1, n];
(1.6) entry label weight vectors collection V2={ v is obtained by carrying out TF-IDF method to V1j1,vj2,…,vjn, vjAIt is
The A entry label weight vectors of entry label weight vectors collection V2, wherein variables A ∈ [1, n];
(1.7) label weight vectors collection V2 is saved to local.
3. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1,
It is characterized in that, using LSA algorithm obtaining LSA model M 1 and index database I1 to V2 in the step (2), specific step is as follows:
(2.1) label weight vectors collection V3, V3={ v are loaded into from localk1,vk2,…,vkn, vkBIt is entry label weight vectors collection
The B weight vectors of V3, wherein B ∈ [1, n];
(2.2) dictionary Dict2 is loaded into from local;
(2.3) id2word=Dict2, number of topics num_topics=300 are defined;
(2.4) by the way that V3, using the training of LSA method, incoming parameter id2word and num_topics obtain model M 1;
(2.5) V3 is handled by model M 1 to obtain packaging corpus C1;
(2.6) index database is established to C1 and obtains index database I1;
(2.7) preservation model M1 and index database I1.
4. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1,
It is characterized in that, obtaining the specific steps of RP model M 2 and index database I2 using accidental projection algorithm to V2 in the step (3)
It is as follows:
(3.1) label weight vectors collection V4, V4={ v are loaded into from locall1,vl2,…,vln, vlCIt is entry label weight vectors collection
The C weight vectors of V4, wherein C ∈ [1, n];
(3.2) number of topics num_topics=500 is defined;
(3.3) by the way that V4, using the training of RP method, incoming parameter num_topics obtains model M 2;
(3.4) V4 is handled by model M 2 to obtain packaging corpus C2;
(3.5) index database is established to C2 and obtains index database I2;
(3.6) preservation model M2 and index database I2.
5. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1,
It is characterized in that, obtaining final recommendation collection in the step (4), specific step is as follows:
(4.1) define D2 be encyclopaedia entry test set, D2={ id2, title2, paragraph2, image2, url2, tag2 },
Wherein id2, title2, paragraph2, image2, url2, tag2 respectively indicate number, title, paragraph, image link, net
Page link and entry label;
(4.2) using title2 as input, by obtaining T2={ w using split method to tag2j1,wj2,…,wjn, wjDIt is
Encyclopaedia entry the D entry tally set of data set, wherein variables D ∈ [1, n];
(4.3) by obtaining entry label vector collection V5={ v using Doc2Bow method to T2m1,vm2,…,vmn, vmEIt is entry
The E entry label vector of label vector collection V5, wherein variable E ∈ [1, n];
(4.4) entry label weight vectors collection V6={ v is obtained by carrying out TF-IDF method to V5o1,vo2,…,von, voFIt is
The F entry label weight vectors of entry label weight vectors collection V6, wherein variable F ∈ [1, n];
(4.5) defined variable k=1 is cyclic variable, for traversing V6;
(4.6) definition set R1, R2 and R3, R1={ simi1,simi2,…,simin, R2={ simj1,simj2,…,simjn,
R3 is empty set, simiGAnd simjGRespectively indicate the G similarity collection, sim in R1 and R2iGAnd simjGInitial value is sky, wherein G
∈[1,n];
(4.7) LSA model M 3 and accidental projection model M 4 are imported, LSA index database I3 and accidental projection index database I4 is imported;
(4.8) step (4.9) are gone to if k≤n, otherwise go to step (4.14);
(4.9) by vokVec1 is packaged to be using LSA methodk, by vokIt is packaged to be using accidental projection method
vec2k;
(4.10) by vec1kSearch index library I3 is calculated and element in I3 and vec1 using cosine similaritykPhase
Collect like degree and is stored in simik, by vec2kSearch index library I4, using cosine similarity be calculated with element in I4 with
vec2kSimilarity collection and be stored in simjk;
(4.11) by simikAnd simjkCorresponding element is averaged to obtain sim after being addedlk;
(4.12) by simlkIt is inserted into R3;
(4.13) k=k+1 goes to step (4.8);
(4.14) the highest 8 elements composition set of similarity in each set of R3 is taken to be stored in result set R4, each element is in R4
To recommend collection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910598004.7A CN110399458B (en) | 2019-07-04 | 2019-07-04 | Text similarity calculation method based on latent semantic analysis and random projection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910598004.7A CN110399458B (en) | 2019-07-04 | 2019-07-04 | Text similarity calculation method based on latent semantic analysis and random projection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110399458A true CN110399458A (en) | 2019-11-01 |
CN110399458B CN110399458B (en) | 2023-05-26 |
Family
ID=68323669
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910598004.7A Active CN110399458B (en) | 2019-07-04 | 2019-07-04 | Text similarity calculation method based on latent semantic analysis and random projection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110399458B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111581378A (en) * | 2020-04-28 | 2020-08-25 | 中国工商银行股份有限公司 | Method and device for establishing user consumption label system based on transaction data |
CN112884053A (en) * | 2021-02-28 | 2021-06-01 | 江苏匠算天诚信息科技有限公司 | Website classification method, system, equipment and medium based on image-text mixed characteristics |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273474A (en) * | 2017-06-08 | 2017-10-20 | 成都数联铭品科技有限公司 | Autoabstract abstracting method and system based on latent semantic analysis |
CN109165382A (en) * | 2018-08-03 | 2019-01-08 | 南京工业大学 | A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines |
-
2019
- 2019-07-04 CN CN201910598004.7A patent/CN110399458B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273474A (en) * | 2017-06-08 | 2017-10-20 | 成都数联铭品科技有限公司 | Autoabstract abstracting method and system based on latent semantic analysis |
CN109165382A (en) * | 2018-08-03 | 2019-01-08 | 南京工业大学 | A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines |
Non-Patent Citations (2)
Title |
---|
JESSICA LIN 等: "Dimensionality Reduction by Random Projection and Latent Semantic Indexing", 《CITESEER》 * |
汪瑾: "基于潜在语义分析的程序代码相似度检测", 《科技创新与应用》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111581378A (en) * | 2020-04-28 | 2020-08-25 | 中国工商银行股份有限公司 | Method and device for establishing user consumption label system based on transaction data |
CN111581378B (en) * | 2020-04-28 | 2024-04-26 | 中国工商银行股份有限公司 | Method and device for establishing user consumption label system based on transaction data |
CN112884053A (en) * | 2021-02-28 | 2021-06-01 | 江苏匠算天诚信息科技有限公司 | Website classification method, system, equipment and medium based on image-text mixed characteristics |
CN112884053B (en) * | 2021-02-28 | 2022-04-15 | 江苏匠算天诚信息科技有限公司 | Website classification method, system, equipment and medium based on image-text mixed characteristics |
Also Published As
Publication number | Publication date |
---|---|
CN110399458B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109815308B (en) | Method and device for determining intention recognition model and method and device for searching intention recognition | |
CN107705066B (en) | Information input method and electronic equipment during commodity warehousing | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN104636402B (en) | A kind of classification of business object, search, method for pushing and system | |
CN106446148A (en) | Cluster-based text duplicate checking method | |
CN106156272A (en) | A kind of information retrieval method based on multi-source semantic analysis | |
Sarawagi et al. | Open-domain quantity queries on web tables: annotation, response, and consensus models | |
CN103049433A (en) | Automatic question answering method, automatic question answering system and method for constructing question answering case base | |
CN101206674A (en) | Enhancement type related search system and method using commercial articles as medium | |
Zhang et al. | Continuous word embeddings for detecting local text reuses at the semantic level | |
CN109840325A (en) | Text semantic method for measuring similarity based on mutual information | |
CN110399458A (en) | A kind of Text similarity computing method based on latent semantic analysis and accidental projection | |
CN107656920A (en) | A kind of skilled personnel based on patent recommend method | |
CN104679784A (en) | O2B intelligent searching method and system | |
CN115115049A (en) | Neural network model training method, apparatus, device, medium, and program product | |
Meng et al. | Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection | |
Senthilkumar et al. | A Survey On Feature Selection Method For Product Review | |
CN111581378A (en) | Method and device for establishing user consumption label system based on transaction data | |
Sato et al. | Text classification and transfer learning based on character-level deep convolutional neural networks | |
CN114153965A (en) | Content and map combined public opinion event recommendation method, system and terminal | |
Yu et al. | Computer Image Content Retrieval considering K‐Means Clustering Algorithm | |
Pang | A personalized recommendation algorithm for semantic classification of new book recommendation services for university libraries | |
Prathyusha et al. | Normalization Methods for Multiple Sources of Data | |
CN114022233A (en) | Novel commodity recommendation method | |
Dastgheib et al. | Persian text classification enhancement by latent semantic space |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20191101 Assignee: Fanyun software (Nanjing) Co.,Ltd. Assignor: HUAIYIN INSTITUTE OF TECHNOLOGY Contract record no.: X2023980052895 Denomination of invention: A Text Similarity Calculation Method Based on Latent Semantic Analysis and Random Projection Granted publication date: 20230526 License type: Common License Record date: 20231219 |