CN102004774A

CN102004774A - Personalized user tag modeling and recommendation method based on unified probability model

Info

Publication number: CN102004774A
Application number: CN 201010546780
Authority: CN
Inventors: 唐杰; 张宁
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2010-11-16
Filing date: 2010-11-16
Publication date: 2011-04-06
Anticipated expiration: 2030-11-16
Also published as: CN102004774B

Abstract

The invention discloses a personalized user tag modeling and recommendation method based on a unified probability model, comprising the following steps: S1, carrying out statistics on tagging behaviors of users on a social tagging site; S2, carrying out formal definition on questions tagged by the users; S3, establishing a topic model based on user tagging, wherein the topic model is a unified probabilistic model and called a UdT model; S4, establishing a frame of a tag recommendation system based on the UdT model, wherein the frame is recommended through learning the interest of the users and according to semantic information included in the interest; and S5, verifying the frame of the tag recommendation system. Experimental results show that by using the method of the invention, user interest can be effectively explored and the accuracy of tag recommendation can be improved.

Description

Modeling of personalized user label and recommend method based on unified probability model

Technical field

The invention belongs to Internet technical field, relate in particular to the study understanding and the recommended technology of personalized user label in the social label website, be specially a kind of modeling of personalized user label and recommend method based on unified probability model.

Background technology

Society's label (Social tagging) is the key property of Web2.0, and it allows the user freely to mark various resources, for example webpage, scientific paper and multimedia resource.Society's label can help user's taxonomic revision and inquiry various information, and simultaneously, it all has very big value for a lot of practical applications, comprises web search, expansion inquiry, personalized search, Internet resources classification and cluster.Appearance and fast development along with social label website, for example social label website (Flickr, Picassa, YouTube, Plaxo), blog (Blogger, WordPress, LiveJournal), encyclopaedia (Wikipedia, PBWiki), microblogging (Twitter, Jaiku), tag system undoubtedly become one of important means of the extensive community data that increases of tissue.

Recently, label recommends to become a big focus of social label research.Label is recommended to recommend maximally related label with user's resources shared exactly.The effect that label is recommended mainly contains two aspects: the one, and for social label website, label recommends to enlarge the tally set of resource, thus the indexed set when increasing retrieve resources; The 2nd, similar with other commending system for the user, the purpose that label is recommended is to strengthen the user experience of user in the mark process, shortens user's think time.Label in the practical application is recommended more complicated and challenging.At first, the resource pouplarity satisfies power law in the actual social label website, and this resource that shows the overwhelming majority only was marked 1 time or 2 times, so certain resource is arranged probably only by one or do not marked by Any user.In this case, collaborative filtering is just no longer suitable, so need further to inquire into the contact between the Internet resources and be labeled in label on other similar resource.Secondly, different users can use the same resource of different label for labelling, and this depends on personal habits.Therefore, need the label commending system of a user individual of design to increase user experience, encourage the user to mark more resources.Personalized labels is recommended and will be recommended in conjunction with user's mark history, and purpose is at each specific user, specific resource is carried out label recommend.

Present personalized labels recommends to mainly contain two kinds of methods: the method that (1) is content-based; (2) based on the method for graph structure.Wherein content-based method is commonly used the interest at family from text message (description of web page contents, scientific paper, label and resource) middle school usually, and then can be for newly user and new resources are recommended.Comparing content-based method based on the method for graph structure has more hypothesis and constraint condition usually, for example supposes that all want all to occur in recommended resource and the user's data in the past.Yet this hypothesis normally can't satisfy in actual applications.This is because the label commending system need still can be made rational recommendation under the situation that system knows nothing Internet resources or user.Two kinds of methods are compared, and the advantage of content-based method is that it is applicable to new user and new resources, but the accuracy rate of this method is not as the method based on graph structure.And only be applicable to old user and old resource based on the method for graph structure, though the accuracy rate height can not be handled the situation of new user and new resources.

In order to make full use of the network structure information of social labeling system, need carry out modeling to the relation between user, resource and the label.There are many researchs that social label network is being carried out modeling at present.For example, social tag system is described as three metanetworks that a node that is made of user, label and resource is formed.This three metanetwork is broken down into a dual network and one one metanetwork and learns wherein potential structure.The researcher who has is modeled to one three metanetwork with social tag system, has increased social dimension (user), and the ontology model under traditional dual network is extended to ternary.The researcher who has has proposed a social label network figure, and wherein label is regarded as connecting the bridge of isomery field different resource, has designed the semi-supervised sorting algorithm based on this network chart.These methods are the social labeling system of research on a network chart all.Another method of studying social labeling system is to simulate social label for labelling process with a generation model.For example, people such as Wu have designed a probability generation model, and in the model, three entities (label, resource, user) in the social tag system are mapped to same concept space, represent this concept space with a multi-C vector, wherein the corresponding knowledge class of each dimension.In addition, the level Bayesian model based on LDA (Latent Dirichlet Allocation) and PLSA (Probabilistic Latent Semantic Analysis) also is used to model society mark.

The rise of Web2.0 has driven the progress of recommending for label.There is certain methods to be based on the historical information of user's mark.For example AutoTag is the label commending system that is in particular blog design by Gilad Mishne.This system has adopted information retrieval method to estimate similarity between the blog first, and for wanting recommended blog to seek similar blog, and the label that will be labeled on these similar blogs sorts, and the sort by frequency of utilization draws the label of recommendation at last.This system also considers user profile, and the information retrieval method of use is comparatively simple.Another label commending system is the FolkRank algorithm, and it utilizes the graph structure information in the social label network.This algorithm is the expansion of famous algorithm PageRank.The researcher who has learns the ordering of label by the method for decomposing based on tensor, thereby recommends.The researcher who also has utilizes the method for tensor dimensionality reduction to carry out label and recommends.The above-mentioned method based on graph structure depends on social comparatively closely label network, and except these methods, some are also very effective based on method of semantic, and the algorithm of people such as Wu design is for example arranged.Yet these methods are not all considered the interest that the user is specific.

People such as Xu utilize collaborative markup information to carry out label and recommend.Their recommend method is intended recommending those to be labeled in label on the target resource by large quantities of users, and notional each face that repeats to allow the label recommended out to cover resource of wishing the label that to recommend by minimizing, the employed method of this algorithm and Del.icio.us website is similar, all can not handle new resource.The researcher who has designs the P-tag algorithm and automatically generates personalized label for webpage.These labels that automatically generate not only relevant with text message on the webpage also with viewer's desktop on file content be correlated with.The researcher who has recommends problem at the label of Flickr website, on the Flickr website, when a user submits a secondary picture and some labels to, system can show a label Candidate Set of having arranged preface automatically to the user, and this label Candidate Set is to generate by the label of user's input before and the common relation that occurs of other labels.But this method depends on the user and imports some label by hand, and other labels are automatically further recommended by system then, can not be applied to fully have only resource but on the problem that marked without any the user.Moreover, because they have only considered the data of common appearance, so the problem of topic drift may occur.Someone has introduced a kind of interactive label commending system of personalization, is in the Flickr website equally, and the meeting special consideration user's of system labeled data is recommended.Because this algorithm also depends on label with existing, so the shortcoming of method above also existing.

More and more researchers begins to pay close attention to the information that depends on the user and wishes can be familiar with the user further and understand their potential interest and preference from their mark behavior.User's markup information was recommended before the researcher who has attempted utilizing.Used label has shown user's preference and interest to a great extent before the user, and is very helpful for recommending.The label that the behavior of the researcher's analysis user browse network that has comes predictive user should use for certain width of cloth picture.The researcher who has uses a method based on the stratification label clustering to carry out personalized label and recommends.Some other researcher has studied the label commending system of real-time high-efficiency.The researcher who also has has designed the automated tag system for text search and digital library's design.

Because problem space is huge, so efficient is the same with accuracy extremely important.In the method for above design, they use the method cut apart figure to improve accuracy rate to reduce algorithm complex simultaneously.In actual applications, the very big and user of data set wishes to obtain real-time recommendation results.Therefore, how to guarantee to carry out expeditiously that personalized user recommends is a major challenge in this field.Simultaneously, the dynamic perfromance of society's mark also is that another studies a question.

Summary of the invention

(1) technical matters that will solve

The technical problem to be solved in the present invention is, how a kind of modeling of personalized user label and recommend method that is applied in the internet is provided, thereby define personalized label for labelling behavior, and the label of certain resource of its mark is predicted by the historical record of user's mark.

(2) technical scheme

For solving the problems of the technologies described above, the invention provides a kind of modeling of personalized user label and recommend method based on unified probability model, modeling of personalized user label and recommend method based on unified probability model may further comprise the steps:

User's mark behavior on S1, the social label of the statistics website;

S2, user's mark problem is carried out formalization definition;

The topic model that S3, foundation mark based on the user, it is a uniform probability model, is called the UdT model; Unified probability model is a kind of all modeled tasks all to be described in a probability model in the model.

S4, set up the framework based on the label commending system of described UdT model, described framework is to recommend by study user's interest and according to the semantic information that comprises in the interest;

The framework of S5, the described label commending system of checking.

Wherein, described step S2 specifically may further comprise the steps:

S21, user's mark behavior form is turned to a tlv triple, described tlv triple comprises user, label and three elements of resource;

Topic in S22, the formalization definition mark problem distributes, and specifically, sets up the T dimension topic distribution vector θ corresponding to user u ∈ U _u∈ R ^T, wherein, vectorial θ _ueveryly satisfy

Each element θ _UzExpression user u is to the interested probability of topic z; And the foundation T dimension topic distribution vector θ ∈ R corresponding with the document d ∈ D that relates to different topics ^T, wherein the every of vectorial θ satisfies

Each element θ wherein _zExpression document d relates to the probability of topic z, and R represents the real number vector;

S23, set up the topic model based on user interest, wherein, user interest is described as the combination of various topics, for the interest of different topics different probability is arranged, and this model is with multivariate normal distribution { p (the t| θ of an employed label t of this user _uRepresent that { p (t| θ distributes _uIn the label t of probable value maximum represented this topic semantically;

S24, set up the topic model of document, the topic model of the document is made up of two normal distributions: the probability distribution { p (t| θ) } of the probability distribution of word w { p (w| θ) } and label t, θ represents the multivariate normal distribution of the topic of document d.

Wherein, described step S3 is specially:

Estimate two class unknown parameters in the UdT model: the distribution θ of the topic of (1) M document, based on the topic distribution θ of user interest _u, the distribute word distribution phi of λ and T topic of the Bernoulli Jacob of M document; (2) for each label t _Di, relative throwing coin is s as a result _Di, the topic z that distributes _Di, described throwing coin result satisfies the Bernoulli Jacob λ that distributes; For each the word w among the document d _Di, relative topic z ' _DiFor used each the label t of user u _Ui, relative topic z _Ui

Wherein, the method for estimated parameter is: at first estimate (a): the posteriority about topic z distributes, and utilizes it to estimate topic distribution θ in first generative process _u, estimate then (b): about throw coin as a result the posteriority of s and topic z distribute, utilize it to obtain second parameter θ in the generative process then, λ, φ and ψ, wherein ψ is the distribution of word, described first generative process is used for the topic of modelling user interest and distributes; Described second generative process is used for the topic of document of modelling mark and distributes.

Wherein, in step S4, described UdT model combined with language model set up the framework of described label commending system.

Wherein, the method that described UdT model is combined with language model is as follows:

The at first mark normalization that two Model Calculation are gone out, then according to the shared weight of mark with two kinds of mark additions, thereby find the label that only in the candidate collection of a model, occurs; Perhaps

Earlier the label that utilizes the UdT model to recommend is sorted, select with the information retrieval method rearrangement then that the label of some sorts again before the rank.

(3) beneficial effect

The present invention has designed a topic model (UdT) based on the user, comes simultaneously to user's the interest and topic distribution the carrying out modeling of document.Compare existing method, the UdT model can automatically identify the label which is marked and depend on user's special interests, and the label which is marked is the decision that distributes of the topic by resource integral body.Then, use the UdT model of designing to solve label and recommend problem.There are two kinds of different combination strategies to utilize the UdT model to improve the accuracy rate that label is recommended.Experimental result shows that method that the present invention proposes can excavate user's interest effectively and improve the accuracy rate that label is recommended.

Description of drawings

Fig. 1 is a UdT illustraton of model proposed by the invention;

Fig. 2 is the framework of the label commending system that designs in the inventive method;

Fig. 3 is the starting point (the example explanation of social label website) of modelling of the present invention;

Fig. 4 is the ACT illustraton of model;

Fig. 5 is to use the precision chart of Bibsonomy data set; Wherein LM represents to use language model to recommend label; ACT represents to recommend label in conjunction with language model and ACT model; UdT1 represents to use in conjunction with strategy one, recommends label with the UdT model; UdT2 represents to use in conjunction with strategy two, recommends label with the UdT model.

Fig. 6 is the performance synoptic diagram of recommending about the label of topic number;

Fig. 7 is the LDA illustraton of model;

Fig. 8 is based on the general topic model of user v.s.; Wherein UdT represents the model that the present invention designs, and UdT-represents the topic model user interest not taken into account;

Fig. 9 is the case study synoptic diagram;

Figure 10 is the method flow diagram of the embodiment of the invention.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention will be further described.

The present invention is by the statistical study to real data, research user's mark behavior and mark purpose, the personalization mark problem of social label website is carried out the formalization definition, wherein will mark the behavior form and turn to tlv triple, and each user's interest is described as a topic distributes, and the label that will be labeled on each resource is modeled to a general label or the label based on user's special interests, and the both learns in a probability generative process.Wherein, proposed a unified probability model (User-dependent Tagging Model, be called for short UdT model) and described mark behavior, the group that this model has been estimated that general topic distributes and distributed based on specific user's topic based on the user.Then, designed a label recommend method, and provided two kinds of recommendation strategies: linear and and the language model of mark based on the UdT model.At last, on the Bibsonomy website of True Data, compare assessment with baseline algorithm (basic language model and author-meeting-topic model (Author-Conference-Topic model, ACT model)).As shown in figure 10, method of the present invention may further comprise the steps:

Step 1: add up the mark behavior of user individual on the social label website;

Step 2: user's mark problem (can be described as social mark problem) is carried out the formalization definition;

Described step 2 comprises:

(1) user is marked the behavior form and turn to a tlv triple.There is following element a social label website: user, label and resource.The user represents that with u ∈ U U represents the set that all users form; Label represents that with t ∈ T T represents the set of all labels; Resource represents that with r ∈ R R represents the set of all resources.The user, label and resource have constituted a tlv triple, label recommend the input data (being user's mark) of problem be D={ (u, r, t) }, u ∈ U, r ∈ R, t ∈ T, output data (i.e. the label of Tui Jianing) be T (u, r)=arg max P (t|u, r).Therefore, user's once mark behavior can be regarded as by such tlv triple (r _i, t _j, u _k) form, its expression user u _kTo resource r _iMark, the label of use is set t _jSo,, following training set data is arranged: D={ (r in order to learn a marking model based on the user ₁, t ₁, u ₁) ..., (r _N, t _N, u _N).Wherein, t _iExpression user u _iTo resource r _iThe tag set of mark, N are the mark total degrees.What society's mark problem was considered is that the label that has marked in the social label website is carried out modeling and analysis; And the consideration of label recommendation problem is on the basis of analyzing social mark problem new resource to be carried out label to recommend.

In addition, other symbols that will use and explanation are as shown in table 1 below among the present invention:

Table 1

(2) formalization defines the topic distribution in the social mark problem.In society's mark problem, a user has many different interest and corresponding topic usually.Say that formally each user u ∈ U has a T dimension topic distribution vector θ corresponding with it _u∈ R ^T, wherein the every of vector satisfies Each element θ in the vector _UzExpression user u is to the interested probability size of topic z.Similarly, for a document, it also can relate to the information about each different topics, and therefore, each document d ∈ D also has the topic distribution vector θ ∈ R of a T dimension corresponding with it ^T, wherein the every of vector satisfies

Similarly, each element θ wherein _zRepresented that document d relates to the probability size of topic z.

(3) foundation is based on the topic model of user interest.Based on the topic model of user u interest is multivariate normal distribution { p (t| θ with its employed label _uRepresent.Among the present invention, user's mark behavior is seen as the manifestation mode of user's interest, therefore can investigate user's interest from the used label of user.User's interest is described as the combination of various topics, but for the interest of different topics different probability is arranged.The label that this model hypothesis user uses is followed label distribution p (the t| θ corresponding to each topic _u).Therefore, the label of probable value maximum has been represented this topic in the distribution semantically.

(4) set up the topic model of document.Different with the topic model of user interest is that the topic model of document is made up of two normal distributions: the probability distribution { p (t| θ) } of the probability distribution of word w { p (w| θ) } and label t.Distribution { p (w| θ) } is followed in the sampling of word in the model hypothesis document, and distribution { p (t| θ) } is followed in the sampling that is labeled in the label on the document.

Step 3: propose label model (User-dependent Tagging Model is called for short the UdT model), as shown in Figure 1 based on the user.This model carries out modeling to document, label and user's topic distribution simultaneously, and can distinguish the mark behavior of user individual and general mark behavior.The basic thought of UdT model is simultaneously the document and the user interest of mark to be carried out modeling with two relevant generative processes.First generative process (seeing the right half part of Fig. 1) is used for the topic of modelling user interest and distributes.Its generative process is: (1) distributes a label distribution phi about this topic z respectively for each topic z _zWith word distribution phi about this topic z _{Z '}, φ _zAnd φ _{Z '}All satisfy the Di Li Cray distribution that priori (probability) is respectively β and β '; (2), be α at first for user u distributes a priori for each user u ∈ U _uDi Li Cray distribution θ _u, distribute as topic about user u; Secondly for each label t that is used by user u _Ui, from topic distribution θ _uDistribute a topic z _UiAnd from word distribution about topic

Distribute a word w _UiSecond generative process (seeing the left-half of Fig. 1) is used for the topic of document of modelling mark and distributes, and its concrete generative process is: for each document d, at first be Di Li Cray distribution θ that priori is (probability) α of document d distribution _d, distribute as topic about document d; Secondly judge that according to the value of s label is relevant or relevant with the topic of unitary document with user personalized interest.The value of s satisfies distribution λ=P (s=0|d)～beta (γ _u, γ); Then, for each the label t that is labeled on the document d _Di, throw coin s as a result with this label _Di, s wherein _DiSatisfy Bernoulli Jacob about the λ s that distributes _Di=bernoulli (λ); If s=0 is then from the topic distribution θ based on the user _uDistribute a topic z _UiAnd from distributing Distribute a label t _DiOtherwise from general topic distribution θ _dDistribute a topic z _DiAnd from distributing

Distribute a label t _DiOnce more for each word w of document d _DiFrom distribution z _dDistribute a topic z ' _DiAgain according to distribution

Distribute a word w _DiAt last, whether model has used a Bernoulli Jacob to distribute to judge the label that is labeled in document based on user's personal interest.

Described step 3 comprises:

Step 3-1: analyze and treat estimated parameter.In order to find the solution the UdT model, need the unknown parameter in the estimation model, parameter promptly obtains this UdT model after determining.Two class unknown parameters are arranged: (1) M document topic distribution θ, the topic distribution θ of user interest in the UdT model _u, the distribute word distribution phi of λ and T topic of the Bernoulli Jacob of M document; (2) for each label t _Di, relative throwing coin result is s _Di, the topic of distribution is z _Di, for each the word w among the document d _Di, relative topic is z ' _Di, for used each the label t of user u _Ui, relative topic is z _UiUsually it is impossible directly finding the solution such probability model.When solving model, the present invention is not the direct estimation Model parameter, but at first estimates (a): the posteriority about topic z distributes, and utilizes it to estimate topic distribution θ in first generative process then _u, estimate then (b): about throw coin as a result the posteriority of s and topic z distribute, utilize it to obtain second parameter θ in the generative process then, λ, φ and ψ, wherein ψ is that word distributes.

Wherein, for the estimation of (a), the sampling algorithm of use and LDA model class seemingly, the two different place is: the LDA modeling statistics be the probability distribution that topic occurs in each document, and here statistics be the probability distribution of the topic that samples of each user.That is, use following posterior probability:

N wherein _UzBe that topic is sampled the number of times about the polynary normal state topic distribution of user u; n _ZtIt is the number of times that label t is generated by topic z; And frequency n ^-uiIn subscript-ui represent number of times except that present example.Upper and lower target implication in the formula of the present invention is analogized with this rule.

Wherein, for the estimation of (b), its principle adopts and (a) estimates similar methods all to be to use Gibbs Sampling method, but different is, needs during estimation simultaneously that s and topic z sample as a result to throwing coin.Correspondingly, the posterior probability of the posterior probability of the label t that samples from the topic z based on the user and the label t that samples from unitary document topic z is defined as respectively:

P (z_{t_{di}}, s_{t_{di}} = 0 | t_{d}, t_{u}, z_{- di}, γ, γ_{u}, α_{u}, β) =

\frac{n_{d 0}^{- di} + γ_{u}}{n_{d 0}^{- di} + n_{d 1}^{- di} + γ_{u} + γ} \cdot \frac{n_{d {0 Z}_{t_{di}}}^{- di} + n_{{uz}_{t_{di}}} + α_{u}}{Σ_{Z} (n_{d 0 z}^{- di} + n_{u} + α_{u})} \cdot \frac{n_{z_{t_{di}} t_{di}}^{- di} + n_{z_{t_{di}} t_{di}}^{u} + β}{Σ_{t} (n_{z_{t_{di}} t}^{- di} + n_{z_{t_{di}} t}^{u} + β)}

P (z_{t_{di}}, s_{t_{di}} = 1 | t_{d}, t_{u}, z_{- di}, γ, γ_{u}, α, β) =

\frac{n_{d 1}^{- di} + γ}{n_{d 0}^{- di} + n_{d 1}^{- di} + γ_{u} + γ} \cdot \frac{n_{d {1 Z}_{t_{di}}}^{- di} + α}{Σ_{Z} (n_{d 1 z}^{- di} + α)} \cdot \frac{n_{z_{t_{di}} t_{di}}^{- di} + n_{z_{t_{di}} t_{di}}^{u} + β}{Σ_{t} (n_{z_{t_{di}} t}^{- di} + n_{z_{t_{di}} t}^{u} + β)}

N wherein _D0Be document d by number of times based on the topic profile samples of user interest; n _D1Be the number of times of document d by the topic profile samples of unitary document content; Frequency n ^uSubscript u represent that it has all counted number to all users.For example,

Expression label t is assigned to the number of times of topic z by all users.

In the process that above-mentioned parameter is estimated, algorithm needs the inferior number vector of a M * T of access (document * topic), the inferior number vector of a T * K (topic * label), one of M * 2 (document * coin value) number of times vector sum | the inferior number vector of U| * T (user's number * topic), || the expression modulo operation.These vectors have been arranged, and algorithm can be estimated the topic distribution θ of document easily _Dz, user's topic distribution θ _UtAnd the label distribution phi of topic _ZvBy following computing formula:

θ_{dz} = \frac{n_{dz} + α}{Σ_{Z^{'}} (n_{{dz}^{'}} + α)}

θ_{ut} = \frac{n_{ut} + α_{u}}{Σ_{Z^{'}} (n_{{qz}^{'}} + α_{u})}

φ_{zt} = \frac{n_{zt}^{d} + n_{zt}^{q} + β}{Σ_{t^{'}} (n_{{zt}^{'}}^{d} + n_{{zt}^{'}}^{u} + β)}

Can get by the algorithm complex analysis to the UdT model, complexity is

Wherein L is the iterations of Gibbs sampling algorithm,

Be the mean value that the user uses the label number, and

Be the mean value of word number in the document.

Step 4: set up based on the label of UdT model and recommend framework.The emphasis of language model is an extracting keywords from title or content, and the starting point of the relevant UdT model of topic is study user's interest and recommends according to potential semantic information.The present invention combines two kinds of methods and has set up the framework of label commending system, as shown in Figure 2.

Step 4 relates to two in conjunction with tactful:

(1) strategy one: UdT1.At first, the mark normalization with two Model Calculation go out promptly, makes ‖ score1 ‖ _∞=‖ score2 ‖ _∞, wherein score1 is the mark that calculates of language model and score2 is the mark that the UdT Model Calculation is come out.If certain label t appears in the candidate collection of language model and does not appear in the training set, then score2[t]=0; On the contrary, do not appear in the Candidate Set of language model if certain label t appears in the training set, then score1[t]=0.The mark of then final label t is:

score[t]＝λ _c·score1[t]+(1-λ _c)·score2[t]

Here λ _cBe the weight of mark addition, use addition here rather than the reason that multiplies each other is that the candidate collection of two models differs bigger, therefore the score1 of many labels is arranged or the score2 value is 0.Can help us to find those labels that occurs in a candidate collection mark addition, this also is the purpose in conjunction with strategy.

(2) strategy two: UdT2.Second kind that proposes among the present invention is to use following formula that label is sorted earlier in conjunction with strategy:

P (t | u^{'}, r^{'}) = \underset{z}{Σ} P (t | z) {avg}_{w &Element; r^{'}} P (z | u)

Wherein

Word wherein _NumThe number of expression word, resource of r ' expression.The label of selecting rank forward (preceding 500) then uses following formula to sort again:

P (t | r^{*}) = \frac{N_{d}}{N_{d} + λ} \cdot \frac{tf (t, d)}{N_{d}} + (1 - \frac{N_{d}}{N_{d} + λ}) \cdot \frac{tf (t, D^{'})}{{N^{'}}_{D}}

N wherein _dBe given resource r ^*Description d in the number of various words, (t d) is given resource r to tf ^*Description d in the word frequencies (occurrence number of word) of word t, N ' _DBe the number of the various words of whole data centralization, tf (t, D ') is the word frequencies (occurrence number of word) of the word t among the whole data set D '.λ is a Di Li Cray smoothing factor, is set as the average length of document usually, promptly here is N ' _D/ | D ' |.UdT2 proposed by the invention has two advantages: 1. do the candidate collection that can enlarge language model like this, be mainly reflected in from coupling and extend to coupling based on topic based on keyword, as two labels " data mining " and " knowledge engineering ", if on the keyword angle then can't mate, but then can mate from the topic angle; 2. owing to the sparse property of data set, the UdT model does not have very big applicability for new user or new resources.The result who obtains with simple information retrieval method rearrangement UdT model can improve the accuracy that final label is recommended.

Step 5: test by True Data and to verify and analyze the UdT model and recommend framework.

Step 5 comprises:

Step 5-1: experimental design.This step comprises training set and the test set of using in the selection experiment, sets up database table, provides the performance index of evaluation and test UdT model, and specifies the pedestal method that is used to contrast;

Step 5-2: recommend experimental result to outgoing label.Label, the distinct methods that this step provides on the data set corresponding popular label and recommendation respectively carries out the graphic analyses that performance index value that label recommends and user interest exert an influence to recommendation results etc.;

Step 5-3: experimental result discussion and analysis.This step comprises the impact effect of analyzing the topic number, will compare based on user's topic model and general topic model, the example case is studied, and the artificial explanation etc. of listing the topic title.

In one embodiment of the invention, with social label and the True Data collection published in the shared system (can referring to www.bibsonomy.org) is example, to the UdT model and the generative process of recommending framework, and how to use the UdT model and recommend the label relevant and excavation to be illustrated based on user's customized information with given resource.

(1) society's mark problem-instance is analyzed.Fig. 3 has shown the example of a social mark problem.One of the left side shows 4 different users among the figure, is used all labels of their difference in each frame, obtains two topics of each user with the LDA model of standard: data mining and machine learning from used all labels of user.One of the right shows 5 different resources among the figure, and the arrow in the middle of the figure indicates that the user marks the process of resource, and for example, user 1 has marked

resource

1 and 2, and user 3 has marked

resource

2,4, and 5.Each frame on the right is the content of resource, the label that has text message and mark to use.The basic topic model of same use has obtained the distribution about data mining and these two topics of machine learning from the text message of resource.Based on the analysis to this example, the present invention wishes to use a new marking model based on the user, learns special interests and the resource implicit topic distribution of user for specific resources simultaneously according to the label of mark, and carries out the personalized user label and recommend.

(2) generate the UdT model.

In step (2), the invention still further relates to the parameter estimation of UdT model, its derivation of equation process is as follows:

Derivation formula for Gibbs Sampling (gibbs sampler):

P (z_{ui} | z_{- ui}, t_{u}, α_{u}, β) = \frac{n_{{uz}_{ui}}^{- ui} + α_{u}}{Σ_{Z} (n_{uz}^{- ui} + α_{u})} \frac{n_{z_{ui} t_{ui}}^{- ui} + β}{Σ_{t} (n_{z_{ui} t}^{- ui} + β)},

Following joint distribution is arranged:

p(t _u，z _u|α _u，β)＝p(t _u|z _u，β)p(z _u|α _u)

According to generating label t _uMultivariate normal distribution, can obtain:

p (t_{u} | z_{u}, φ) = Π_{i = 1}^{N_{u}} φ_{z_{ui} t_{ui}} = Π_{z_{u} = 1}^{T} Π_{i = 1}^{V_{u}} φ_{z_{ui} t_{ui}}^{n_{z_{ui} t_{ui}}}

N wherein _uBe the label number that user u uses, Be multivariate normal distribution, and

Be label t _UiBe assigned to topic z _UiOn number of times.

φ is carried out integration, can obtain:

p (t_{u} | z_{u}, β) = &Integral; p (tu | zu, φ) p (φ | β) dφ = &Integral; Π_{z_{u} = 1}^{T} \frac{1}{Δ (β)} Π_{i = 1}^{V_{u}} φ_{z_{ui} t_{ui}}^{n_{z_{ui} t_{ui}} + β - 1} d φ_{z} = Π_{z_{u} = 1}^{T} \frac{Δ (n_{zu} + β)}{Δ (β)}

Wherein

n_{zu} = {n_{z_{u} t_{u}}^{i}}_{i = 1}^{V_{u}}

And

Δ (β) = \frac{Π_{i = 1}^{k} Γ (β_{i})}{Γ Σ_{i = 1}^{k} β_{i}}

Equally, can derive and obtain following formula:

p (z_{u} | α_{u}) = Π_{u = 1}^{U} \frac{Δ (n_{u} + α_{u})}{Δ (α_{u})}, n_{u} = {n_{{uz}_{u}}^{(i)}}_{i = 1}^{T}

Above-mentioned formula is multiplied each other, can obtain:

p (z_{u}, t_{u} | α_{u}, β) = Π_{z_{u} = 1}^{T} \frac{Δ (n_{zu} + β)}{Δ (β)} Π_{u = 1}^{U} \frac{Δ (n_{u} + α_{u})}{Δ (a_{u})}

From joint distribution, can obtain following condition and distribute:

p (Z_{ui} | z_{- ui}, t_{u}, α_{u}, β)

= \frac{p (z_{u}, t_{u} | α_{u}, β)}{p (z_{- ui}, t_{u} | α_{u}, β)} = \frac{p (t_{u} | z_{u}, β)}{p (t_{u} | z_{- ui}, β)} \cdot \frac{p (z_{u} | α_{u})}{p (z_{- ui} | α_{u})} = \frac{Δ (n_{zu} + β) \cdot Δ (n_{u} + α_{u})}{Δ (n_{z - ui} + β) \cdot Δ (n_{- ui} + α_{u})}

\frac{\frac{Γ (n_{z_{u} t_{u}}^{(i)} + β)}{Γ (Σ_{i = 1}^{V_{n}} (n_{z_{u} t_{u}}^{(i)} + β))} \cdot \frac{Γ (n_{{uz}_{u}}^{(i)} + α_{u})}{Γ (Σ_{i = 1}^{T} (n_{{uz}_{u}}^{(i)} + α_{u}))}}{\frac{Γ (n_{z_{u} t_{u}}^{(i)} + β - 1)}{Γ (Σ_{i = 1}^{V_{n}} (n_{z_{u} t_{u}}^{(i)} + β) - 1)} \cdot \frac{Γ (n_{{uz}_{u}}^{(i)} + α_{u} - 1)}{Γ (Σ_{i = 1}^{T} (n_{{uz}_{u}}^{(i)} + α_{u}) - 1)}} = \frac{n_{{uz}_{ui}}^{- ui} + α_{u}}{Σ_{Z} (n_{uz}^{- ui} + α_{u})} \frac{n_{z_{ui} t_{ui}}^{- ui} + β}{Σ_{t} (n_{z_{ui} t}^{- ui} + β)}

Similarly, λ is carried out integration, obtains:

p (s | γ_{u}, γ) = &Integral; p (s | λ) p (λ | γ_{u}, γ) dλ = &Integral; (\underset{i}{Π} λ^{s_{i}} {(1 - λ)}^{(1 - s_{i})}) \cdot \frac{λ^{(γ_{u} - 1)} {(1 - λ)}^{(γ - 1)}}{Beta (γ_{u}, γ)} dλ

= {&Integral; λ}^{n_{d 0}} {(1 - λ)}^{n_{d 1}} \cdot \frac{λ^{(γ_{u} - 1)} {(1 - λ)}^{(γ - 1)}}{Beta (γ_{u}, γ)} dλ = \frac{&Integral; λ^{(n_{d 0} + r_{u} - 1)} \cdot {(1 - λ)}^{(n_{d 1} + r - 1)} dλ}{Beta (γ_{u}, γ)}

= \frac{Beta (n_{d 0} + γ_{u}, n_{d 1} + γ)}{Beta (γ_{u}, γ)}

Next will derive about

More new formula, following conditional probability is arranged, wherein n _D0Be that the topic of document d is sampled the number of times based on user's topic, and n _D1It is the number of times that the topic of document d is sampled general topic.n ^uSubscript u represent to have calculated all users here, for example Expression is for all users, and label t is assigned to the total degree of topic z.

p (Z_{t_{di}}, s_{t_{di}} = 0 | t_{d}, t_{u}, z_{- di}, γ, γ_{u}, α_{u}, β) = p (s_{t_{di}} | = 0 | γ, γ_{u}) \cdot p (Z_{t_{di}} | s_{t_{di}} = 0, t_{d}, t_{u}, z_{- di}, α_{u}, β)

= p (s_{t_{di}} = 0 | γ, γ_{u}) \cdot \frac{p (z_{t_{d}}, t_{d}, t_{u} | s_{t_{di}} = 0, α_{u}, β)}{p (z_{- t_{di}}, t_{d}, t_{u} | s_{t_{di}} = 0, α_{u}, β)}

= \frac{p (s_{- i}, s_{t_{di}} = 0 | γ, γ_{u})}{p (s_{- i} | γ, γ_{u})} \cdot \frac{p (z_{t_{d}} | s_{t_{di}} = 0, α_{u})}{p (z_{t_{- di}} | s_{t_{di}} = 0, α_{u})} \cdot \frac{p (t_{d}, t_{u} | s_{t_{di}} = 0, z_{t_{d}}, β)}{p (t_{d}, t_{u} | s_{t_{di}} = 0, z_{t_{- di}}, β)}

= \frac{B (n_{d_{0}}^{{- d}_{i}} + γ_{u} + 1, n_{d_{1}}^{{- d}_{i}} + γ)}{B (n_{d_{0}}^{- d_{i}} + γ_{u}, n_{d_{1}}^{{- d}_{i}} + γ)} \cdot \frac{Δ (n_{d_{0} z} + n_{u} + α_{u})}{Δ (n_{d_{0} - zi} + n_{u} + α_{u})} \cdot \frac{Δ (n_{zu} + n_{zt} + β)}{Δ (n_{zu} + n_{z - ti} + β)}

= \frac{\frac{(n_{d_{0}}^{- d_{i}} + γ_{u})! (n_{d_{1}}^{- d_{i}} + γ - 1)!}{(n_{d_{0}}^{- d_{i}} + γ_{u} + n_{d_{1}}^{- d_{i}} + γ)!}}{\frac{(n_{d_{0}}^{- d_{i}} + γ_{u} - 1)! (n_{d_{1}}^{- d_{i}} + γ - 1)}{(n_{d_{0}}^{- d_{i}} + γ_{u} + n_{d_{1}}^{- d_{i}} + γ - 1)!}} \cdot \frac{\frac{Γ (n_{d_{0} z_{t}}^{(di)} + (n_{{uz}_{t}}^{(di)} + α_{u})}{Γ Σ_{i = 1}^{T} (n_{d_{0} z_{t}}^{(di)} + (n_{{uz}_{t}}^{(di)} + α_{u}))}}{\frac{Γ (n_{d_{0} z_{t}}^{(di)} + (n_{{uz}_{t}}^{(di)} + α_{u} - 1)}{Γ Σ_{i = 1}^{T} (n_{d_{0} z_{t}}^{(di)} + (n_{{uz}_{t}}^{(di)} + α_{u}) - 1)}} \cdot \frac{\frac{Γ (n_{z_{t} t}^{(ui)} + n_{z_{t} t}^{(di)} + β)}{Γ (Σ_{i = 1}^{V_{t}} (n_{z_{t} t}^{(ui)} + n_{z_{t} t}^{(di)} + β))}}{\frac{Γ (n_{z_{t} t}^{(ui)} + n_{z_{t} t}^{(di)} + β - 1)}{Γ (Σ_{i = 1}^{V_{t}} (n_{z_{t} t}^{(ui)} + n_{z_{t} t}^{(di)} + β) - 1)}}

= \frac{n_{d_{0}}^{- di} + γ_{u}}{n_{d_{0}}^{- di} + n_{d_{1}}^{- di} + γ_{u} + γ} \cdot \frac{n_{d 0 z_{t_{di}}}^{- di} + n_{{uzt}_{di}} + α_{u}}{Σ_{Z} (n_{d 0 z}^{- di} + n_{u} + α_{u})} \cdot \frac{n_{z_{t_{di}} t_{di}}^{- di} + n_{z_{t_{di}} t_{di}}^{u} + β}{Σ_{t} (n_{z_{t_{di}} t}^{- di} + n_{z_{t_{di}} t}^{u} + β)}

Simultaneously,

p (Z_{t_{di}}, s_{t_{di}} = 1 | t_{d}, t_{u}, z_{- di}, γ, γ_{u}, α_{u}, β)

= \frac{p (s_{- i}, s_{t_{di}} = 1 | γ, γ_{u})}{p (s_{- i} | γ, γ_{u})} \cdot \frac{p (z_{t_{d}} | s_{t_{di}} = 1, α_{u})}{p (z_{t_{- di}} | s_{t_{di}} = 1, α_{u})} \cdot \frac{p (t_{d}, t_{u} | s_{t_{di}} = 1, z_{t_{d}}, β)}{p (t_{d}, t_{u} | s_{t_{di}} = 1, z_{t_{- di}}, β)}

= \frac{B (n_{d_{0}}^{{- d}_{i}} + γ_{u}, n_{d_{1}}^{{- d}_{i}} + γ + 1)}{B (n_{d_{0}}^{- d_{i}} + γ_{u}, n_{d_{1}}^{{- d}_{i}} + γ)} \cdot \frac{Δ (n_{d_{1} z} + n_{u} + α_{u})}{Δ (n_{d_{1} - zi} + n_{u} + α_{u})} \cdot \frac{Δ (n_{zu} + n_{zt} + β)}{Δ (n_{zu} + n_{z - ti} + β)}

= \frac{n_{d_{1}}^{- di} + γ_{u}}{n_{d_{1}}^{- di} + n_{d_{1}}^{- di} + γ_{u} + γ} \cdot \frac{n_{d 1 z_{t_{di}}}^{- di} + n_{{uzt}_{di}} + α_{u}}{Σ_{Z} (n_{d 1 z}^{- di} + n_{u} + α_{u})} \cdot \frac{n_{z_{t_{di}} t_{di}}^{- di} + n_{z_{t_{di}} t_{di}}^{u} + β}{Σ_{t} (n_{z_{t_{di}} t}^{- di} + n_{z_{t_{di}} t}^{u} + β)}

(3) recommending to select two kinds in the framework for use based on the UdT model that has generated: UdT1 and UdT2 in conjunction with tactful;

(4) select training set and test set;

Training set data and the test set data selected in concrete the enforcement are provided by the match that the ECML meeting was held in 09 year.Training set is all data on the Bibsonomy website before on June 1st, 2009.In this training set, one has 389,009 mark behaviors, has marked 56,386 different bookmark resources and 41,874 different bibtex resources altogether.One has 37,998 resources has only occurred in training set 1 time.One has 2,271 different users and 37,880 different labels in the training set.The label that marks the bookmark resource on average has 4.234, and the label that marks on the bibtex resource on average has 3.588.

Test set in concrete the enforcement is all data from Bibsonomy website on July 1st, 1 day 1 June in 2009.Training set data and test set data are distinct, and resource and the user (this has increased the difficulty of label commending system undoubtedly) who does not occur in many training sets arranged in test set.In test set, one has 26,072 mark behaviors, wherein about the bookmark resource 8,361 mark behaviors is arranged, and about the bibtex resource 17,711 mark behaviors is arranged.Have only 1,265 different bookmark resource and 2,138 different bibtex resources to appear in the training set in the test set altogether, remaining all is emerging resource (being new resources).The label that marks the bookmark resource on average has 3.9 and the label that marks on the bibtex resource on average has 4.086.

(5) selection algorithm exploitation and enforcement running environment;

The bare metal learning algorithm is realized with Microsoft Visual Studio 2005, and all enforcement all is at a double-core, uses Intel Xeon processor (3.0GHz), in save as 8GB, operating system is to finish on the server of Windows 2003.

(6) determine to weigh the index that label is recommended performance;

Use precision (accuracy) in the enforcement, recall (recall rate), f-measure (f measures) recommends the index of performance as weighing label.P@n wherein, precision, recall and f-measure when r@n and f@n represent respectively to recommend n label.For given user u and given resource r, correct tally set is defined as TAG, and (u, r), the tally set of recommendation is defined as

Then this moment precision, recall and f-measure are distributed as:

precision (\tilde{T} (u, r)) = \frac{1}{| U |} \underset{u &Element; U}{Σ} \frac{| TAG (u, r) | \cap \tilde{T} (u, r)}{\tilde{T} (u, r)}

recall (\tilde{T} (u, r)) = \frac{1}{| U |} \underset{u &Element; U}{Σ} \frac{| TAG (u, r) | \cap \tilde{T} (u, r)}{TAG (u, r)}

f - measure (\tilde{T} (u, r)) = \frac{2 \times recall \times precision}{recall + precision}

(7) pedestal method of selection contrast;

Use following several diverse ways in the enforcement and compared their performances in label is recommended.

1), use language model to recommend label;

2), use is recommended label in conjunction with tactful one in conjunction with language model and ACT model (as shown in Figure 4);

3), use combination strategy one usefulness UdT model to recommend label;

4), use is recommended label in conjunction with tactful dual-purpose UdT model.

(7) provide result of implementation;

Popular label corresponding and the five big labels of being recommended out by the UdT model have been shown in Bibsonomy data centralization the most popular 5 bibtex resources and 5 bookmark resources in the table 2 with resource.

Table 2

In " label of recommendation " hurdle, the label that sections out with runic illustrates that this label that the UdT model is recommended also is popular label.From table 2, can see,, mark normally some general words of superincumbent popular label, use the label commending system of UdT model can recommend the label relevant usually with resource for those popular resources.Except those and the identical example of correct option, also find to have very similarly word of a lot of semantemes, for example, " portable " and " ontology " in second row can represent " portableontology "; " web " in the fifth line is very relevant with " web20 ", and " compare " is similar to " comparison ".Usually UdT model total energy is recommended the label relevant with given resource, and the back can illustrate that the label commending system that uses the UdT model also can excavate the customized information based on the user.The precision of the label recommend method of the whole bag of tricks is as shown in table 3, and the f-measure value is as shown in table 4.Table 3 shows that the UdT model can improve the performance that has the label recommend method now and reach 7.67%.

Table 3

In the table 3, P@1 represents that the accuracy of recommending article one label to return, P@3 represent the accuracy of recommending the 3rd strip label to return, and P@5 represents the accuracy of recommending the 5th strip label to return, P@7 represents that the accuracy of recommending the 7th strip label to return, P@10 represent the accuracy of recommending the tenth strip label to return.

Table 4

In the table 4, f@1 represents the comprehensive evaluating value of recommending article one label to return, f@3 represents the comprehensive evaluating value of recommending the 3rd strip label to return, f@5 represents the comprehensive evaluating value of recommending the 5th strip label to return, f@7 represents that the comprehensive evaluating value of recommending the 7th strip label to return, f@10 represent the comprehensive evaluating value of recommending the tenth strip label to return.

(8) result of implementation analysis

From two concrete results of implementation of aspect analysis: the 1. impact effect of topic number.As other probability model, the setting that the number of the topic in the UdT model can be manual.In order to investigate the topic number whole label is recommended Effect on Performance, respectively the number of the topic in the UdT model is made as 40,50 in the enforcement, 65,80,100,200,300 and 500, the f-measure value of different topic numbers can see Table 4.Can find that from table 4 along with the growth of the topic number in the UdT model, the f-measure value also increases thereupon, and propagation process slows down gradually.In addition, the growth of topic number will reduce the time efficiency of whole algorithm, is therefore obtaining higher accuracy rate and reducing computation complexity that a balance is arranged between the two.After consideration, the topic number in the enforcement of the overwhelming majority is set to 50.2. based on the general topic model of user v.s..If a user is new markup information without any the past, then the UdT model also can't obtain his potential preference, and this moment, the UdT model can be lowered one's standard or status into a topic model (as shown in Figure 7) that is similar to LDA.The different results of implementation that we will obtain come comparison we based on user's topic model and traditional general topic model.Concrete visible Fig. 8 of result of implementation, wherein formula has been used in the UdT-representative

Method, and UdT has been to use formula

Method.Table 5 shows representative label and the word in experiment.Each topic is represented that by generating probability the highest label and word the title of topic is artificial explanation.

Table 5

In addition, table 5 has been listed some topics that the UdT model obtains and about the label and the word of these topics, as can be found from Table 5: the label in a topic is very relevant with the word in the resource description, and they can represent this topic.

(8) case study

Fig. 9-1～9-4 is another example.From Fig. 9-1 as can be seen, in this example, have 5 users to mark certain resource simultaneously, but used different labels, Fig. 9-4 has comprised totally 10 topic distribution statistics figure of 5 users, wherein, corresponding two statistical graphs of each user, first is " user's overall situation topic distributes ", second is " distribution of user partial topic ", the implication of each figure is: each is with the probability of occurrence size of the topic of number-mark, and this topic of the big explanation of probability is the hot issue of often being used by the user.Can find that by this case study the UdT model is the mark behavior that can excavate user individual really.The topic that the UdT model can obtain this resource distributes, and sees Fig. 9-1, and in this example, the most popular topic of resource is topic #21 (" Data Mining ").UdT model further identification label is the user individual definition or relevant with whole topic distribution.Fig. 9-3 shows all labels that are assigned to this resource, from Fig. 9-3 as can be seen, in this example, the label of forming with the speech of frame of broken lines in the filled box is represented to distribute based on user's topic, it is the user individual definition, and the label that the speech of band frame of broken lines is formed represents that general topic distributes, and is relevant with whole topic.Fig. 9-1 is that 5 topics based on user interest distribute, and Fig. 9-the 2nd, the topic of the label that the user uses distributes.From Fig. 9-4, can also find some very significant patterns: the topic of the label that the user uses distributes and the topic of the interest of user own is distributed with very big correlationship.Also have an interesting phenomenon to be: the user of specialty prefers using more special label and general user likes using more general label.For instance, the resource among the figure is about data mining, and user 778 is that he has used label " research " and " mining " on the topic 10 (" data analysis ") " expert "; And 353 pairs of various topics of user all are interested in, and he has used label " visualization ", and this is a general word, and are little with the whole topic relation of resource.

Experimental result:

Utilize step 1 of the present invention～5, two the personalized user commending systems that comprise have been created in conjunction with strategy based on the UdT model, carry out based on COMPARISON OF CALCULATED RESULTS WITH EXPERIMENTAL DATA experiment, experimental result shows that the topic model (UdT model) based on user's mark proposed by the invention can excavate user's interest effectively and improve the accuracy rate that label is recommended.

Above embodiment only is used to illustrate the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; under the situation that does not break away from the spirit and scope of the present invention; can also make various variations and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims

1. modeling of personalized user label and recommend method based on a unified probability model is characterized in that, may further comprise the steps:

User's mark behavior on S1, the social label of the statistics website;

S2, user's mark problem is carried out formalization definition;

The topic model that S3, foundation mark based on the user, it is a uniform probability model, is called the UdT model;

The framework of S5, the described label commending system of checking.

2. method according to claim 1 is characterized in that, described step S2 specifically may further comprise the steps:

Each element θ wherein _zExpression document d relates to the probability of topic z;

3. method according to claim 2 is characterized in that, described step S3 is specially:

4. method according to claim 3 is characterized in that, the method for two class unknown parameters in the described estimation UdT model is: at first estimate (a): the posteriority about topic z distributes, and utilizes it to estimate topic distribution θ in first generative process _u, estimate then (b): about throw coin as a result the posteriority of s and topic z distribute, utilize it to obtain second parameter θ in the generative process then, λ, φ and ψ, wherein ψ is the distribution of word, described first generative process is used for the topic of modelling user interest and distributes; Described second generative process is used for the topic of document of modelling mark and distributes.

5. method according to claim 4 is characterized in that, in step S4, the UdT model is combined with language model set up the framework of described label commending system.

6. method according to claim 5 is characterized in that, the described method that the UdT model is combined with language model is as follows: