CN106599037A

CN106599037A - Recommendation method based on label semantic normalization

Info

Publication number: CN106599037A
Application number: CN201610972494.9A
Authority: CN
Inventors: 叶婷; 曹杰; 姚瑞波; 崔莹; 伍之昂; 申冬琴
Original assignee: Nanjing University of Finance and Economics; Focus Technology Co Ltd
Current assignee: Nanjing University of Finance and Economics; Focus Technology Co Ltd
Priority date: 2016-11-04
Filing date: 2016-11-04
Publication date: 2017-04-26
Anticipated expiration: 2036-11-04
Also published as: CN106599037B

Abstract

The invention relates to a recommendation method based on label semantic normalization. The method comprises the steps that a user-defined label is preprocessed to obtain the preprocessed label, and label semantic similarity is obtained according to calculation; a label resource matrix is obtained according to the preprocessed label, and then label resource co-occurrence similarity is obtained through calculation; linear fusion similarity is obtained through calculation according to the label semantic similarity and the label resource co-occurrence similarity, the linear fusion similarity is subjected to clustering operation to obtain label data after semantic normalization by a user, and collaborative filtering recommendation is performed in combination with the label data obtained after semantic normalization by the user. With respect to a plenty of redundant labels or labels with inaccurate semantic expression in a previous label system, by means of label normalization, the semantic expression of the normalized labels can be clearer; in a recommendation system, recommendation quality, namely accuracy and efficiency can be improved, and recommendation time can be shortened.

Description

It is a kind of that method is recommended based on label semantic normalizationization

Technical field

The present invention relates to personalized recommendation method field, and in particular to a kind of to build the individual of user interest model based on label Property recommended technology.

Background technology

Developing rapidly for the Internet greatly changes people's search information, the approach of shared knowledge, while changing people Communication interaction mode between people.With the continuous growth of information, user needs to take a substantial amount of time with energy from magnanimity Resource required for searching in information, this phenomenon are referred to as problem of information overload.Then commending system arises.Recommend system System is a kind of preference according to user and some historical datas, intelligently filters out from magnanimity information resource and embody on a small quantity user Technology of the resource recommendation of preference to user, so as to preferably solve problem of information overload.But commending system itself is also deposited In some drawbacks, the Deta sparseness of system, cold start-up problem, stability problem etc. are mainly shown as, these defects make recommendation System is difficult to make correct recommendation according to the historical behavior of user.

At present, the design of personalized recommendation system is carried out using the characteristic of label, make the research of commending system enter one The individual new stage.Because label is defined by user's unrestricted choice in tag system, the characteristic attribute of resource is not only described, can be with Reflect the interest and cognitive preference of user.It is more and more extensive however as label application, it is also more next the drawbacks of occur in label More obvious, as the autonomy expression concept obfuscation of label, different user cognition there is also difference, this causes the semanteme which is expressed It is inaccurate, while user may be not rigorous enough in input label, cause the presence of much noise label.

Therefore, the drawbacks of present invention is for commending system itself, drawback concrete manifestation include the Deta sparseness of system, cold Starting problem and stability problem etc., these drawbacks make commending system be difficult to make correct recommendation according to the historical behavior of user And for current internet information transship and Social Label system in label numerous and diverse scrappy property and be not likely to produce semantic pass A kind of the problems such as connection, there is provided recommendation method based on label semantic normalization.

The content of the invention

Present invention aim to address the drawbacks of commending system itself, drawback concrete manifestation includes the Sparse of system Property, cold start-up problem and stability problem etc., these drawbacks make commending system be difficult to be made correctly according to the historical behavior of user Recommendation and for current internet information transship and Social Label system in label numerous and diverse scrappy property and be not likely to produce The problems such as semantic association,

For achieving the above object, the invention provides a kind of recommendation method based on label semantic normalization, using label Between semantic relation, and replace user-defined random label with the label of semantic normalization, build the interest model of user Recommendation is produced to user based on collaborative filtering method again, so as to lift the quality of recommendation.

Comprise the following steps：Pretreatment is carried out to user-defined label, pretreated label is obtained；Based on pre- place Label after reason obtains the term vector of all words in label using Word2Vec training patterns, is calculated institute according to term vector State label semantic similarity；Label resources matrix is obtained according to pretreated label, and utilizes label resources matrix calculus Obtain label resources co-occurrence similarity；Linear fusion is obtained according to label semantic similarity and label resources co-occurrence Similarity Measure Linear fusion similarity is carried out cluster operation by similarity, obtains the label data after user semantic standardization, with reference to the use Label data after the semantic normalization of family carries out collaborative filtering recommending.

Wherein, linear fusion similarity is carried out into cluster operation step, including：Label is built according to linear fusion similarity Degrees of fusion matrix, obtains K class cluster of preset value of label according to tag fusion degree matrix；According to K class cluster of preset value of label The judgement of clustering convergence condition is carried out, if meeting wherein any one condition can obtain new user's standardization label.

The judgement of clustering convergence condition is carried out according to K class cluster of preset value of label, if being unsatisfactory for wherein any one bar The step of part performs the linear fusion similarity structure tag fusion degree matrix.Clustering convergence condition includes：Without data Point is reassigned to different classes, or new cluster centre and original center of birdsing of the same feather flock together is identical.

Label data after user semantic standardization is obtained according to linear fusion similarity, after user semantic standardization Label data include the step of carry out collaborative filtering recommending:User is built using the label data after user semantic standardization emerging Interesting model, and the interest of label data item of interest t after user semantic standardization is calculated using TagBasedTF-IDF algorithms Degree；User is recommended according to the interest-degree and collaborative filtering of label data item of interest t after user semantic standardization.

Computing formula described in the step of label semantic similarity is calculated according to term vector is as follows：

Wherein,Label a is represented, the term vector of b is represented, in semantic space model, the COS distance of two term vectors Their semantic similarity is expressed as.

Computing formula described in the step of label resources co-occurrence similarity is obtained using label resources matrix calculus is as follows：

Wherein, for label a, N (a) is made to be the article set for having label a, n_a,iFor the number of users of the tagged a of article i, The existing similarity formula of cosine calculates the resource co-occurrence similarity of label a and label b.

In the step of linear fusion similarity is obtained according to label semantic similarity and label resources co-occurrence Similarity Measure Described computing formula is as follows：

Wherein, λ is regulation weight factor.

User interest model is the formal semantics label interest model that user is built according to vector space model.

The invention has the beneficial effects as follows:Can will be substantial amounts of redundancy or semantic meaning representation in tag system before inaccurate Label through standardization label make standardization after label semantic meaning representation it is clearer and more definite；Due to label by user according to the reason of oneself Solution hobby is arbitrarily marked, so building user interest model using the label after standardization, can preferably reflect user interest, And semantic meaning representation is clearly, has reached the purpose of Data Dimensionality Reduction；The present invention by the semantic association of label, and with the mark of semantic normalization The interest model for building user is signed, the label data that can show user intention hobby is make use of well, it is final to improve Recommend quality；The present invention can improve recommendation quality i.e. accuracy in being applied to commending system and efficiency reduces the recommendation time.

Description of the drawings

Fig. 1 is provided in an embodiment of the present invention a kind of based on label semantic normalizationization recommendation method flow diagram；

Fig. 2 is provided in an embodiment of the present invention a kind of based on label semantic normalizationization recommendation method overall pattern；

Fig. 3 hits rate structure chart for a kind of algorithm provided in an embodiment of the present invention.

Specific embodiment

Below by drawings and Examples, technical scheme is described in further detail.

In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with accompanying drawing the present invention is made into One step ground is described in detail, it is clear that described embodiment is only a part of embodiment of the invention, rather than the enforcement of whole Example.Based on the embodiment in the present invention, what those of ordinary skill in the art were obtained under the premise of creative work is not made Every other embodiment, belongs to the scope of protection of the invention.

Fig. 1 is provided in an embodiment of the present invention a kind of based on label semantic normalizationization recommendation method flow diagram.Such as Fig. 1 institutes Show, S101：Pretreatment is carried out to user-defined label, pretreated label is obtained；S102：Based on pretreated mark The term vector of all words obtained using Word2Vec training patterns in label is signed, the label language is calculated according to term vector Adopted similarity；S102 ＂：Label resources matrix is obtained according to pretreated label, and is obtained using label resources matrix calculus To label resources co-occurrence similarity；S103：Obtained linearly according to label semantic similarity and label resources co-occurrence Similarity Measure Linear fusion similarity is carried out cluster operation by fusion similarity, obtains the label data after user semantic standardization, with reference to use Label data after the semantic normalization of family carries out collaborative filtering recommending.

Fig. 2 is provided in an embodiment of the present invention a kind of based on label semantic normalizationization recommendation method overall pattern.Such as Fig. 2 institutes Show, method provided in an embodiment of the present invention can be divided into three steps：First step, carries out pretreatment to user-defined label, Pretreated label is obtained, all words in label are obtained based on pretreated label using Word2Vec training patterns Term vector, is calculated label semantic similarity according to term vector；Label resources matrix is obtained according to pretreated label, with And label resources co-occurrence similarity is obtained using label resources matrix calculus.

Second step, obtains linear fusion according to label semantic similarity and label resources co-occurrence Similarity Measure similar Linear fusion similarity is carried out cluster operation by degree.

Third step, obtains the label data after user semantic standardization, with reference to the number of tags after user semantic standardization According to carrying out collaborative filtering recommending.

First step is preferably：Label data in User Defined label is carried out into pretreatment, pretreatment include duplicate removal, Remove stop words and remove punctuation mark etc..For example：Pretreated label data tapers to 15838.After pretreatment Label the term vector of all words in label is obtained using Word2Vec training patterns, this step can be based on English wikipedia Data obtain term vector model, for the English label of any input can be given its word in wikipedia training pattern to Amount is represented.The term vector of such as label represents that part is as follows：

Label after Word2Vec training patterns is calculated into the semantic similarity of outgoing label, tool using following formula Body formula is as described below：

Wherein, it is describedLabel a is represented, the term vector of b is represented, in semantic space model, the cosine of two term vectors Distance is expressed as their semantic similarity.

Label data in User Defined label is carried out into pretreatment, label resources is built using pretreated label Matrix, calculates label resources co-occurrence similarity according to label resources matrix and below equation, and concrete formula is as follows：

Label semantic similarity and label resources co-occurrence similarity preferably, are entered line according to equation below by second step Property fusion, linear fusion similarity is obtained after fusion, concrete formula is as follows：

Wherein, λ is regulation weight factor.According to the difference for seeking λ value in tag fusion similarity, hitting in the embodiment of the present invention Middle rate hit-rate also has certain change, if neighbours' number is respectively 20,40,60,80, tests which and hit rate on data set As a result it is as shown in Figure 3.

Tag fusion similarity matrix is built according to linear fusion similarity, is counted by tag fusion similarity matrix respectively Calculate user and define tag set T_uIn label to the fusion similarity at k Ge Cu centers, these labels are incorporated into similarity respectively In class cluster belonging to maximum cluster centre.

According to the above results, the judgement of the condition of convergence is carried out, at least one class cluster is drawn if the condition of convergence is met, wherein Standardize label, and the User Defined tag set of respective class cluster including cluster centre, User Defined label is replaced Change the cluster centre (standardization label) of its place class cluster into, form the label after user semantic standardization, wherein new use Family-standardization label-resource data { U, T_S, I } represent；If being unsatisfactory for the condition of convergence, perform and linearly melted according to then basis The step of closing similarity and build tag fusion similarity matrix, the respective center of k class cluster is recalculated so that being assigned to the cluster Customized label to the new center semantic similarity sum it is maximum, constantly update cluster centre, until meeting the condition of convergence.

Wherein, the condition of convergence includes：Different classes, or new cluster centre and original are reassigned to without data point Center of birdsing of the same feather flock together is identical.

User tag class cluster is drawn according to cluster operation, and by the User Defined label cluster centre of its place class cluster I.e. semantic normalization label is substituted.This step is chosen several with the User Defined label for representing meaning and its corresponding place The semantic normalization label at Lei Cu centers, and by these customized labels according to the fusion similarity at Lei Cu centers from big to small Arrangement is as follows,

Third step preferably, is calculated using the label data after user's sentence specification and TagBasedTF-IDF algorithms Interest-degrees of each user u with regard to label item of interest t after its customized label semantic normalization.

The formal semantics mark of user is built according to the interest-degree of label item of interest t and using VSM (vector space model) Sign interest model.

Specifically, the interest model of user is built using VSM (vector space model), is exactly a n-dimensional vector in fact {(T₁,W₁),(T₂,W₂),...,(T_n,W_n), to the interest level for representing user's aspect interested and this aspect, Wherein, T_iThe tag entry after the semantic normalization being replaced in expression interest characteristicss item i.e. the 2nd step, W_iIt is characteristic item T_iPower Weight.The similarity between user and user is calculated further according to the user interest model for obtaining, and it is near to obtain the K of the Top-N of user Adjacent user, finally calculates user for the prediction interest-degree of the project that do not mark, finally obtains recommendation list.Using recall rate, standard True rate, coverage rate, popularity Verification.In Collaborative Filtering Recommendation Algorithm, the number preset value K of neighbour user is to recommending essence Degree has certain impact, and preset value K is too little cannot to obtain enough project sets, and preset value K too conferences increase algorithm Searching cost.For different preset value K values, accuracy rate is different, and the accuracy of algorithm is also just different, experimental result such as following table：

According to the formal semantics label interest model of user obtain in user interest model targeted customer and other users to The cosine similarity of amount, finds K user of the preset value most like with which, i.e. preset value k nearest neighbor user, for targeted customer's Preset value k nearest neighbor user, draws user u for the prediction interest of resource i not marked.

Sorted by prediction interest-degree from big to small, take top n resource composition Top-N and recommend set recommend_list= {i₁,i₂,...i_NAnd export, collaborative filtering recommending is carried out with reference to the label data after user semantic standardization.

Above-described specific embodiment, has been carried out further to the purpose of the present invention, technical scheme and beneficial effect Describe in detail, the be should be understood that specific embodiment that the foregoing is only the present invention is not intended to limit the present invention Protection domain, all any modification, equivalent substitution and improvements within the spirit and principles in the present invention, done etc. all should include Within protection scope of the present invention.

Claims

1. it is a kind of that method is recommended based on label semantic normalizationization, it is characterised in that to comprise the following steps：

Pretreatment is carried out to user-defined label, pretreated label is obtained；

The term vector of all words in label is obtained based on the pretreated label using Word2Vec training patterns, according to The term vector is calculated the label semantic similarity；

Label resources matrix is obtained according to the pretreated label, and label money is obtained using label resources matrix calculus Source co-occurrence similarity；

Linear fusion similarity is obtained according to the label semantic similarity and the label resources co-occurrence Similarity Measure, by institute Stating linear fusion similarity carries out cluster operation, obtains the label data after user semantic standardization, with reference to the user semantic Label data after standardization carries out collaborative filtering recommending.

2. method according to claim 1, it is characterised in that described that the linear fusion similarity is carried out into cluster operation Step, including：

Tag fusion degree matrix is built according to the linear fusion similarity, label is obtained according to the tag fusion degree matrix K class cluster of preset value；

The judgement of clustering convergence condition is carried out according to K class cluster of preset value of the label, if meeting wherein any one condition New user's standardization label can be obtained.

3. method according to claim 2, it is characterised in that clustered according to K class cluster of preset value of the label The judgement of the condition of convergence, builds tag fusion degree if being unsatisfactory for wherein any one condition and performing the linear fusion similarity The step of matrix.

4. the method according to claim 2-3, it is characterised in that the clustering convergence condition includes：

Different classes are reassigned to without data point, or new cluster centre and original center of birdsing of the same feather flock together is identical.

5. method according to claim 1, it is characterised in that described that user's language is obtained according to the linear fusion similarity Justice standardization after label data, with reference to the user semantic standardization after label data carry out collaborative filtering recommending the step of Including:

User interest model is built using the label data after user semantic standardization, and is calculated using TagBasedTF-IDF Method calculates the interest-degree of label data item of interest t after the user semantic standardization；

User is carried out according to the interest-degree and collaborative filtering of label data item of interest t after user semantic standardization Recommend.

6. method according to claim 1, it is characterised in that described that the label language is calculated according to the term vector Computing formula described in the step of adopted similarity is as follows：

\cos s i m (\overset{&RightArrow;}{a}, \overset{&RightArrow;}{b}) = \frac{\overset{&RightArrow;}{a} \cdot \overset{&RightArrow;}{b}}{| | \overset{&RightArrow;}{a} | | \cdot | | \overset{&RightArrow;}{b} | |}

Wherein, it is describedLabel a is represented, the term vector of b is represented, in semantic space model, the COS distance of two term vectors Their semantic similarity is expressed as.

7. method according to claim 1, it is characterised in that the utilization label resources matrix calculus obtain label resources Computing formula described in the step of co-occurrence similarity is as follows：

s i m (a, b) = \frac{Σ_{i &Element; N (a) \cap N (b)} n_{a, i} n_{b, i}}{\sqrt{Σ_{i &Element; N (a)} n_{a, i}^{2}} \sqrt{Σ_{i &Element; N (b)} n_{b, i}^{2}}}

Wherein, for label a, N (a) is made to be the article set for having label a, n_a,iFor the number of users of the tagged a of article i, cosine Existing similarity formula calculates the resource co-occurrence similarity of label a and label b.

8. method according to claim 1, it is characterised in that described according to the label semantic similarity and the label Computing formula described in the step of resource co-occurrence Similarity Measure obtains linear fusion similarity is as follows：

s i m (\overset{&RightArrow;}{a}, \overset{&RightArrow;}{b}) = λ s i m (a, b) + (1 - λ) \cos s i m (\overset{&RightArrow;}{a}, \overset{&RightArrow;}{b})

Wherein, λ is regulation weight factor.

9. method according to claim 5, it is characterised in that the user interest model is according to vector space model VSM builds the formal semantics label interest model of user.