CN110399458A

CN110399458A - A kind of Text similarity computing method based on latent semantic analysis and accidental projection

Info

Publication number: CN110399458A
Application number: CN201910598004.7A
Authority: CN
Inventors: 朱全银; 吴思凯; 王啸; 赵建洋; 宗慧; 冯万利; 周泓; 丁瑾; 陈伯伦; 曹苏群
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2019-11-01
Anticipated expiration: 2039-07-04
Also published as: CN110399458B

Abstract

The Text similarity computing method based on latent semantic analysis and accidental projection that the invention discloses a kind of, suitable for universal Unsupervised text clustering problem.The present invention is converted to bag of words to label text to be processed first, and carries out assigning to it weighing to operate using TF-IDF algorithm and obtain weight vectors collection.Then weight vectors collection is handled using LSA algorithm to obtain LSA index database, then weight vectors collection is handled to obtain RP index database using accidental projection algorithm.It after corpus to be calculated is finally carried out TF-IDF processing, then is compared respectively using after LSA algorithm and RP algorithm process with index database content, obtains text similarity.The present invention can carry out effective similarity calculation to content of text, and carry out related content recommendation by the higher text of similarity.

Description

A kind of Text similarity computing method based on latent semantic analysis and accidental projection

Technical field

The invention belongs to natural language processing field, in particular to a kind of text based on latent semantic analysis and accidental projection This similarity calculating method.

Background technique

In traditional text proposed algorithm, researchers can select to carry out by common vector space model etc. similar Degree calculates.By semantic association between excavation text, provided reliably in conjunction with latent semantic analysis and accidental projection for related system Text similarity computing method.

The existing Research foundation of Zhu Quanyin et al. includes: Wanli Feng.Research of theme statement extraction for chinese literature based on lexical chain.International Journal of Multimedia and Ubiquitous Engineering,Vol.11,No.6(2016),pp.379- 388；Wanli Feng,Ying Li,Shangbing Gao,Yunyang Yan,Jianxun Xue.A novel flame edge detection algorithm via a novel active contour model.International Journal of Hybrid Information Technology,Vol.9,No.9(2016),pp.275-282；Liu Jinling, Method for mode matching [J] microelectronics and computer of the Feng Wanli based on Feature Dependence relationship, 2011,28 (12): 167- 170；Liu Jinling, Feng Wanli, Zhang Yahong initialize cluster class center and reconstruct text cluster [J] computer application of scaling function Research, 2011,28 (11): 4115-4117；Chinese short message text of Liu Jinling, Feng Wanli, the Zhang Yahong based on scale again is poly- Class method [J] computer engineering and application, 2012,48 (21): 146-150.；Zhu Quanyin, Pan Lu, Liu Wenru wait .Web scientific and technological News category extraction algorithm [J] Huaiyingong College journal, 2015,24 (5): 18-24；Li Xiang, Zhu Quanyin joint are clustered and are commented Shared collaborative filtering recommending [J] the computer science of sub-matrix and exploration, 2014,8 (6): 751-759；Quanyin Zhu, Sunqun Cao.A Novel Classifier-independent Feature Selection Algorithm for Imbalanced Datasets.2009,p:77-82；Quanyin Zhu,Yunyang Yan,Jin Ding,Jin Qian.The Case Study for Price Extracting of Mobile Phone Sell Online.2011,p: 282-285；Quanyin Zhu,Suqun Cao,Pei Zhou,Yunyang Yan,Hong Zhou.Integrated Price Forecast based on Dichotomy Backfilling and Disturbance Factor Algorithm.International Review on Computers and Software,2011,Vol.6(6):1089- 1093；Zhu Quanyin, Feng Wanli et al. application, openly with the related patents of authorization: a kind of intelligence of Feng Wanli, Shao Heshuai, Zhuan Jun is cold Hide car state monitoring wireless network terminal installation: CN203616634U [P] .2014；Zhu Quanyin, Hu Rongjing, He Suqun, Zhou Pei A kind of equal price forecasting of commodity method Chinese patent based on linear interpolation Yu Adaptive windowing mouth of: ZL 2,011 1 0423015.5,2015.07.01；Zhu Quanyin, Cao Suqun, Yan Yunyang, Hu Rong wait quietly, and one kind is repaired and upset based on two divided datas The price forecasting of commodity method Chinese patent of the factor: 2,011 1 0422274.6,2013.01.02 of ZL；Li Xiang, Zhu Quanyin, recklessly A kind of Cold Chain Logistics prestowage intelligent recommendation method China Patent Publication No. based on spectral clustering of Rong Lin, Zhou Hong: CN105654267A,2016.06.08；Zhu Quanyin, Xin Cheng, Li Xiang, Xu Kang et al., one kind are tested based on K-means and LDA is two-way The network behavior of card is accustomed to clustering method China Patent Publication No.: CN 106202480 A, 2016.12.07；Zhu Quanyin, Tanghai Wave, Yan Yunyang, Li Xiang, Hu Ronglin, Qu Xuexin, Shaowu is outstanding, Xu Kang, Zhaoyang, a kind of use based on deep learning of Qian Kai, Gaoyang Family literature reading interest analysis method China Patent Publication No.: CN108280114A, 2018.07.13；Zhu Quanyin, in the persimmon people, A kind of expert of knowledge based map of Hu Ronglin, Feng Wanli, Zhou Hong combines recommended method China Patent Publication No.: CN109062961A,2018.12.21。

Latent semantic analysis:

Latent semantic analysis is a kind of new information retrieval algebraic model, is the computational theory for knowledge acquisition and displaying And method, it analyzes a large amount of text set using the method that statistics calculates, to extract potential between word and word Semantic structure, and indicate word and text with this potential semantic structure, reach the correlation eliminated between word and simplify text The purpose of this vector realization dimensionality reduction.

Accidental projection:

Accidental projection is that a kind of simple and effective dimension about subtracts method, it is different from some feature extraction sides based on presentation Method.Feature extraction is totally independent of raw sample data set, not will cause the great distortion of data.Data after dimensionality reduction are still The important feature information that original high dimensional data is included is remained, and needs not move through matrix decomposition to acquire transformed matrix, The ability of real-time processing data can be greatly improved.

When comparing problem towards traditional text similarity, has paper and mainly carry out similarity ratio using simple shared word It is right, but this method is poor between the associated similitude of the semantic topic document comparison effect.

Summary of the invention

Goal of the invention: in view of the above-mentioned problems, the present invention provides a kind of text based on latent semantic analysis and accidental projection Similarity calculating method changes the limitation of Traditional calculating methods, carries out dimensionality reduction to vector space in the way of a variety of dimensionality reductions, Effectively increase the accuracy and reliability of result.

Technical solution: the present invention proposes a kind of Text similarity computing side based on latent semantic analysis and accidental projection Method includes the following steps:

(1) tag vectorization is obtained into entry label vector collection V1, and be used for TF-IDF algorithm obtain label weight to Quantity set V2；

(2) LSA model M 1 and index database I1 are obtained using LSA algorithm to V2；

(3) RP model M 2 and index database I2 are obtained using accidental projection algorithm to V2；

(4) corpus to be processed is handled using TF-IDF, and carries out LSA and RP processing, obtain final recommendation collection.

Further, label weight vectors collection V2 is obtained in the step (1) specific step is as follows:

(1.1) define D1 be encyclopaedia entry data set, D1=id1, title1, paragraph1, image1, url1, Tag1 }, wherein id1, title1, paragraph1, image1, url1, tag1 respectively indicate number, title, paragraph, picture chain It connects, web page interlinkage and entry label；

(1.2) by obtaining T1={ w using split method to tag1_i1,w_i2,…,w_in, w_iAIt is encyclopaedia entry data set The A entry tally set, wherein variables A ∈ [1, n]；

(1.3) by obtaining dictionary Dict1 using Dictionary method to T1；

(1.4) dictionary Dict1 is saved to local；

(1.5) by obtaining entry label vector collection V1={ v using Doc2Bow method to T1_i1,v_i2,…,v_in, v_iAIt is The A entry label vector of entry label vector collection V1, wherein variables A ∈ [1, n]；

(1.6) entry label weight vectors collection V2={ v is obtained by carrying out TF-IDF method to V1_j1,v_j2,…,v_jn, v_jAIt is the A entry label weight vectors of entry label weight vectors collection V2, wherein variables A ∈ [1, n]；

(1.7) label weight vectors collection V2 is saved to local.

Further, LSA model M 1 and the specific step of index database I1 are obtained using LSA algorithm to V2 in the step (2) It is rapid as follows:

(2.1) label weight vectors collection V3, V3={ v are loaded into from local_k1,v_k2,…,v_kn, v_kBEntry label weight to The B weight vectors of quantity set V3, wherein B ∈ [1, n]；

(2.2) dictionary Dict2 is loaded into from local；

(2.3) id2word=Dict2, number of topics num_topics=300 are defined；

(2.4) by the way that V3, using the training of LSA method, incoming parameter id2word and num_topics obtain model M 1；

(2.5) V3 is handled by model M 1 to obtain packaging corpus C1；

(2.6) index database is established to C1 and obtains index database I1；

(2.7) preservation model M1 and index database I1.

Further, the tool of RP model M 2 and index database I2 is obtained using accidental projection algorithm to V2 in the step (3) Steps are as follows for body:

(3.1) label weight vectors collection V4, V4={ v are loaded into from local_l1,v_l2,…,v_ln, v_lCEntry label weight to The C weight vectors of quantity set V4, wherein C ∈ [1, n]；

(3.2) number of topics num_topics=500 is defined；

(3.3) by the way that V4, using the training of RP method, incoming parameter num_topics obtains model M 2；

(3.4) V4 is handled by model M 2 to obtain packaging corpus C2；

(3.5) index database is established to C2 and obtains index database I2；

(3.6) preservation model M2 and index database I2.

Further, final recommendation collection is obtained in the step (4), and specific step is as follows:

(4.1) define D2 be encyclopaedia entry test set, D2=id2, title2, paragraph2, image2, url2, Tag2 }, wherein id2, title2, paragraph2, image2, url2, tag2 respectively indicate number, title, paragraph, picture chain It connects, web page interlinkage and entry label；

(4.2) using title2 as input, by obtaining T2={ w using split method to tag2_j1,w_j2,…,w_jn, w_jDIt is encyclopaedia entry the D entry tally set of data set, wherein variables D ∈ [1, n]；

(4.3) by obtaining entry label vector collection V5={ v using Doc2Bow method to T2_m1,v_m2,…,v_mn, v_mEIt is The E entry label vector of entry label vector collection V5, wherein variable E ∈ [1, n]；

(4.4) entry label weight vectors collection V6={ v is obtained by carrying out TF-IDF method to V5_o1,v_o2,…,v_on, v_oFIt is the F entry label weight vectors of entry label weight vectors collection V6, wherein variable F ∈ [1, n]；

(4.5) defined variable k=1 is cyclic variable, for traversing V6；

(4.6) definition set R1, R2 and R3, R1={ sim_i1,sim_i2,…,sim_in, R2={ sim_j1,sim_j2,…, sim_jn, R3 is empty set, sim_iGAnd sim_jGRespectively indicate the G similarity collection, sim in R1 and R2_iGAnd sim_jGInitial value is It is empty, wherein G ∈ [1, n]；

(4.7) LSA model M 3 and accidental projection model M 4 are imported, LSA index database I3 and accidental projection index database I4 is imported；

(4.8) step (4.9) are gone to if k≤n, otherwise go to step (4.14)；

(4.9) by v_okVec1 is packaged to be using LSA method_k, by v_okIt is packed using accidental projection method To vec2_k；

(4.10) by vec1_kSearch index library I3 is calculated and element in I3 and vec1 using cosine similarity_k Similarity collection and be stored in sim_ik, by vec2_kSearch index library I4 is calculated and element in I4 using cosine similarity With vec2_kSimilarity collection and be stored in sim_jk；

(4.11) by sim_ikAnd sim_jkCorresponding element is averaged to obtain sim after being added_lk；

(4.12) by sim_lkIt is inserted into R3；

(4.13) k=k+1 goes to step (4.8)；

(4.14) the highest 8 elements composition set of similarity in each set of R3 is taken to be stored in result set R4, each member in R4 Element is to recommend collection.

The present invention by adopting the above technical scheme, has the advantages that

The present invention carries out text similarity to the entry obtained from Baidupedia, using latent semantic analysis and accidental projection It calculates, this method changes the limitation of Traditional calculating methods, carries out dimensionality reduction to vector space in the way of a variety of dimensionality reductions, effectively Improve the accuracy and reliability of result.

Detailed description of the invention

Fig. 1 is overview flow chart of the invention；

Fig. 2 is the specific flow chart of TF-IDF algorithm in Fig. 1；

Fig. 3 is the specific flow chart of LSA algorithm in Fig. 1；

Fig. 4 is the specific flow chart of accidental projection algorithm in Fig. 1；

Fig. 5 is the specific flow chart of similarity recommendation function in Fig. 1.

Specific embodiment

Combined with specific embodiments below, the present invention is furture elucidated, it should be understood that these embodiments are merely to illustrate the present invention Rather than limit the scope of the invention, after the present invention has been read, those skilled in the art are to various equivalences of the invention The modification of form falls within the application range as defined in the appended claims.

As Figure 1-Figure 5, a kind of text similarity based on latent semantic analysis and accidental projection of the present invention Calculation method includes the following steps:

Step 1: tag vectorization being obtained into entry label vector collection V1, and is used for TF-IDF algorithm and obtains label power Weight vector set V2, method particularly includes:

Step 1.1: definition D1 be encyclopaedia entry data set, D1=id1, title1, paragraph1, image1, Url1, tag1 }, wherein id1, title1, paragraph1, image1, url1, tag1 respectively indicate number, title, paragraph, Image link, web page interlinkage and entry label；

Step 1.2: by obtaining T1={ w using split method to tag1_i1,w_i2,…,w_in, w_iAIt is encyclopaedia entry number According to the A entry tally set of collection, wherein variables A ∈ [1, n]；

Step 1.3: by obtaining dictionary Dict1 using Dictionary method to T1；

Step 1.4: dictionary Dict1 is saved to local；

Step 1.5: by obtaining entry label vector collection V1={ v using Doc2Bow method to T1_i1,v_i2,…,v_in, v_iAIt is the A entry label vector of entry label vector collection V1, wherein variables A ∈ [1, n]；

Step 1.6: obtaining entry label weight vectors collection V2={ v by carrying out TF-IDF method to V1_j1,v_j2,…, v_jn, v_jAIt is the A entry label weight vectors of entry label weight vectors collection V2, wherein variables A ∈ [1, n]；

Step 1.7: label weight vectors collection V2 is saved to local.

Step 2: LSA model M 1 and index database I1 are obtained using LSA algorithm to V2, method particularly includes:

Step 2.1: being loaded into label weight vectors collection V3, V3={ v from local_k1,v_k2,…,v_kn, v_kBIt is entry label power The B weight vectors of weight vector set V3, wherein B ∈ [1, n]；

Step 2.2: being loaded into dictionary Dict2 from local；

Step 2.3: defining id2word=Dict2, number of topics num_topics=300；

Step 2.4: by the way that V3, using the training of LSA method, incoming parameter id2word and num_topics obtain model M1；

Step 2.5: V3 being handled by model M 1 to obtain packaging corpus C1；

Step 2.6: index database being established to C1 and obtains index database I1；

Step 2.7: preservation model M1 and index database I1.

Step 3: RP model M 2 and index database I2 are obtained using accidental projection algorithm to V2, method particularly includes:

Step 3.1: being loaded into label weight vectors collection V4, V4={ v from local_l1,v_l2,…,v_ln, v_lCIt is entry label power The C weight vectors of weight vector set V4, wherein C ∈ [1, n]；

Step 3.2: defining number of topics num_topics=500；

Step 3.3: by the way that V4, using the training of RP method, incoming parameter num_topics obtains model M 2；

Step 3.4: V4 being handled by model M 2 to obtain packaging corpus C2；

Step 3.5: index database being established to C2 and obtains index database I2；

Step 3.6: preservation model M2 and index database I2.

Step 4: corpus to be processed is handled using TF-IDF, and carries out LSA and RP processing, obtains final recommendation collection, Method particularly includes:

Step 4.1: definition D2 be encyclopaedia entry test set, D2=id2, title2, paragraph2, image2, Url2, tag2 }, wherein id2, title2, paragraph2, image2, url2, tag2 respectively indicate number, title, paragraph, Image link, web page interlinkage and entry label；

Step 4.2: using title2 as input, by obtaining T2={ w using split method to tag2_j1,w_j2,…, w_jn, w_jDIt is encyclopaedia entry the D entry tally set of data set, wherein variables D ∈ [1, n]；

Step 4.3: by obtaining entry label vector collection V5={ v using Doc2Bow method to T2_m1,v_m2,…,v_mn, v_mEIt is the E entry label vector of entry label vector collection V5, wherein variable E ∈ [1, n]；

Step 4.4: obtaining entry label weight vectors collection V6={ v by carrying out TF-IDF method to V5_o1,v_o2,…, v_on, v_oFIt is the F entry label weight vectors of entry label weight vectors collection V6, wherein variable F ∈ [1, n]；

Step 4.5: defined variable k=1 is cyclic variable, for traversing V6；

Step 4.6: definition set R1, R2 and R3, R1={ sim_i1,sim_i2,…,sim_in, R2={ sim_j1, sim_j2,…,sim_jn, R3 is empty set, sim_iGAnd sim_jGRespectively indicate the G similarity collection, sim in R1 and R2_iGAnd sim_jGJust Initial value is sky, wherein G ∈ [1, n]；

Step 4.7: importing LSA model M 3 and accidental projection model M 4, import LSA index database I3 and accidental projection index database I4；

Step 4.8: going to step 4.9 if k≤n, otherwise go to step 4.14；

Step 4.9: by v_okVec1 is packaged to be using LSA method_k, by v_okIt is packed using accidental projection method Obtain vec2_k；

Step 4.10: by vec1_kSearch index library I3, using cosine similarity be calculated with element in I3 with vec1_kSimilarity collection and be stored in sim_ik, by vec2_kSearch index library I4, is calculated and I4 using cosine similarity Middle element and vec2_kSimilarity collection and be stored in sim_jk；

Step 4.11: by sim_ikAnd sim_jkCorresponding element is averaged to obtain sim after being added_lk；

Step 4.12: by sim_lkIt is inserted into R3；

Step 4.13:k=k+1, goes to step 4.8；

Step 4.14: taking each in the highest 8 elements composition set deposit result set R4, R4 of similarity in each set of R3 Element is to recommend collection.

The following are argument tables according to the present invention:

1 global variable table of table

Variable-definition	Name variable
		tag	Entry label
V1	Entry label vector collection
		V2	Label weight vectors collection
M1	LSA model
		I1	LSA index database
M2	RP model
		I2	RP index database

2 step 1 argument table of table

Variable-definition	Name variable
		D1	Encyclopaedia entry data set
id1	Entry data set number
		title1	Entry data set title
paragraph1	Entry data set paragraph
		image1	Entry data set image link
url1	Entry data set web page interlinkage
		tag1	Entry label
T1	The set of entry tally set
		w_iA	The A entry tally set in the set of entry tally set
Dict1	Dictionary
		v_iA	The A entry label vector of V1
v_jA	The A entry label weight vectors of V2

3 step 2 argument table of table

4 step 3 argument table of table

Variable-definition	Name variable
		V4	Label weight vectors collection
v_lC	The C weight vectors of V4
		num_topics	The parameter 1 of RP method
C2	The corpus of RP packaging is carried out to V4
		num_topics	The parameter 2 of LSA method
C1	The corpus of LSA packaging is carried out to V3

5 step 4 argument table of table

In order to illustrate the validity of method proposed by the present invention, by carrying out data processing to 46935 history entries, Space projection is carried out to entry label using latent semantic analysis and accidental projection and calculates result using cosine similarity.It should Method effectively overcomes the limitation of traditional text similarity calculation, is carried out by two kinds of algorithm combinations to Text similarity computing Optimization.

The present invention can be in conjunction with computer system, to complete Text similarity computing and then export to recommend collection.

The invention propose it is a kind of latent semantic analysis method is combined with projecting method immediately, to text It is projected, obtains best recommendation results.

It is sub that the above description is only an embodiment of the present invention, is not intended to restrict the invention.It is all in principle of the invention Within, made equivalent replacement should all be included in the protection scope of the present invention.The content that the present invention is not elaborated is answered Belong to prior art well known to this professional domain technical staff.

Claims

1. a kind of Text similarity computing method based on latent semantic analysis and accidental projection, which is characterized in that including as follows Step:

(1) tag vectorization is obtained into entry label vector collection V1, and is used for TF-IDF algorithm and obtains label weight vectors collection V2；

2. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1, It is characterized in that, obtaining label weight vectors collection V2 in the step (1), specific step is as follows:

(1.1) define D1 be encyclopaedia entry data set, D1={ id1, title1, paragraph1, image1, url1, tag1 }, Wherein id1, title1, paragraph1, image1, url1, tag1 respectively indicate number, title, paragraph, image link, net Page link and entry label；

(1.2) by obtaining T1={ w using split method to tag1_i1,w_i2,…,w_in, w_iAIt is encyclopaedia entry data set A A entry tally set, wherein variables A ∈ [1, n]；

(1.3) by obtaining dictionary Dict1 using Dictionary method to T1；

(1.4) dictionary Dict1 is saved to local；

(1.5) by obtaining entry label vector collection V1={ v using Doc2Bow method to T1_i1,v_i2,…,v_in, v_iAIt is entry The A entry label vector of label vector collection V1, wherein variables A ∈ [1, n]；

(1.7) label weight vectors collection V2 is saved to local.

3. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1, It is characterized in that, using LSA algorithm obtaining LSA model M 1 and index database I1 to V2 in the step (2), specific step is as follows:

(2.1) label weight vectors collection V3, V3={ v are loaded into from local_k1,v_k2,…,v_kn, v_kBIt is entry label weight vectors collection The B weight vectors of V3, wherein B ∈ [1, n]；

(2.2) dictionary Dict2 is loaded into from local；

(2.3) id2word=Dict2, number of topics num_topics=300 are defined；

(2.5) V3 is handled by model M 1 to obtain packaging corpus C1；

(2.6) index database is established to C1 and obtains index database I1；

(2.7) preservation model M1 and index database I1.

4. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1, It is characterized in that, obtaining the specific steps of RP model M 2 and index database I2 using accidental projection algorithm to V2 in the step (3) It is as follows:

(3.1) label weight vectors collection V4, V4={ v are loaded into from local_l1,v_l2,…,v_ln, v_lCIt is entry label weight vectors collection The C weight vectors of V4, wherein C ∈ [1, n]；

(3.2) number of topics num_topics=500 is defined；

(3.4) V4 is handled by model M 2 to obtain packaging corpus C2；

(3.5) index database is established to C2 and obtains index database I2；

(3.6) preservation model M2 and index database I2.

5. a kind of Text similarity computing method based on latent semantic analysis and accidental projection according to claim 1, It is characterized in that, obtaining final recommendation collection in the step (4), specific step is as follows:

(4.1) define D2 be encyclopaedia entry test set, D2={ id2, title2, paragraph2, image2, url2, tag2 }, Wherein id2, title2, paragraph2, image2, url2, tag2 respectively indicate number, title, paragraph, image link, net Page link and entry label；

(4.3) by obtaining entry label vector collection V5={ v using Doc2Bow method to T2_m1,v_m2,…,v_mn, v_mEIt is entry The E entry label vector of label vector collection V5, wherein variable E ∈ [1, n]；

(4.5) defined variable k=1 is cyclic variable, for traversing V6；

(4.6) definition set R1, R2 and R3, R1={ sim_i1,sim_i2,…,sim_in, R2={ sim_j1,sim_j2,…,sim_jn, R3 is empty set, sim_iGAnd sim_jGRespectively indicate the G similarity collection, sim in R1 and R2_iGAnd sim_jGInitial value is sky, wherein G ∈[1,n]；

(4.8) step (4.9) are gone to if k≤n, otherwise go to step (4.14)；

(4.9) by v_okVec1 is packaged to be using LSA method_k, by v_okIt is packaged to be using accidental projection method vec2_k；

(4.10) by vec1_kSearch index library I3 is calculated and element in I3 and vec1 using cosine similarity_kPhase Collect like degree and is stored in sim_ik, by vec2_kSearch index library I4, using cosine similarity be calculated with element in I4 with vec2_kSimilarity collection and be stored in sim_jk；

(4.12) by sim_lkIt is inserted into R3；

(4.13) k=k+1 goes to step (4.8)；

(4.14) the highest 8 elements composition set of similarity in each set of R3 is taken to be stored in result set R4, each element is in R4 To recommend collection.