CN108959579B

CN108959579B - System for acquiring personalized features of user and document

Info

Publication number: CN108959579B
Application number: CN201810739450.0A
Authority: CN
Inventors: 祁勇
Original assignee: Weifang Jiubao Intelligent Technology Co ltd
Current assignee: Zhu Yanling
Priority date: 2012-06-25
Filing date: 2012-06-25
Publication date: 2021-11-09
Anticipated expiration: 2032-06-25
Also published as: CN103514237A; CN103514237B; CN108959579A

Abstract

The invention provides a method and a system for acquiring personalized features of a user and a document. The method automatically updates the personalized features of the user and the document through a signal that the user accesses the document. The personalized features of the user are updated according to the personalized features of the document accessed by the user; and the personalized features of the document are updated according to the personalized features of the user who accesses the document. According to the acquired personalized features of the user and the documents, personalized document sequencing can be realized in a search engine; according to the personalized characteristics of the user and the document, personalized information filtering and screening can be realized in the social network. The invention also provides a system for acquiring the personalized characteristics of the user and the document. The method can improve the precision ratio of the search engine and the efficiency of information retrieval of the social network. In addition, the method can improve the anti-cheating capability of the webpage ranking algorithm.

Description

System for acquiring personalized features of user and document

The application is a divisional application of a patent with an application date of 2012, 6 and 25 and an application number of 201210228726.1, and is named as 'a method and a system for acquiring personalized features of a user and a document'.

Technical Field

The invention relates to the field of Internet, in particular to a method and a system for acquiring personalized features of users and documents.

Background

Search engines and social networks are the primary tools for obtaining information on the internet. The two tools have a common disadvantage that information cannot be filtered and filtered according to the personalized features of the user. For example, different users enter the same keyword in the same search engine, and the returned search results are the same regardless of which user submitted the search query; different users establish the same relationship network in the same social network, and the obtained information is the same regardless of which user establishes the relationship network.

The search engine is an application program which utilizes information retrieval technology to collect, index and sort large-scale web pages and presents the web pages to query users according to the sorting result. The core technology of search engines is the ranking algorithm, most notably the PageRank algorithm of ***. The input of the algorithm is a webpage link relation constructed by a webpage designer according to subjective intention. While it adequately reflects the personal preferences of web page designers and understanding of web page link relationships, it does not reflect the personal preferences of users, users of search engines. Since the importance ratings of the same web page are usually different for users who are engaged in different industries or have different preferences, and the difference cannot be distinguished by the conventional ranking techniques of PageRank and the like, they can only give a unique ranking of web pages to different users, which is a disadvantage of the conventional search techniques. One possible technical solution is to improve the search results in conjunction with the personalized features of the user and the web pages such that the ranking of each web page depends not only on the link relationship between the web pages, but also on the personalized features of the user submitting the search query and the personalized features of the queried web pages. Analysis shows that the accuracy of a search engine can be improved and scanning and browsing of invalid information by a user can be reduced by means of personalized features of the user and a webpage.

Social networks are platforms on the internet where people communicate with each other. In the social network, a user acquires information through a relationship network established by the user, for example, acquires information posted by another person through operations such as paying attention (follow) to another person and adding friends. The more people that are attended to and added as friends, the more information the user obtains. Users often focus on more people or join more friends in a social network due to fear that important or interesting information is missed. However, after the number of users in the relationship network exceeds dunba number (Dunbar)150, social networks such as micro blogs and facebooks (Facebook) become services for "information bombing" users. The reason for this is that existing social networking technology requires that users must receive all information posted by all users in their relationship network, without being able to selectively receive such information by information category, which is a disadvantage of existing social networking technology. One possible technical solution is to make the information obtained by the user depend not only on the relationship network established by the user, but also on the personalized features of the user and the personalized features of the obtained information. The method and the device are beneficial to effectively filtering and screening mass information on the social network, and the information retrieval efficiency of the social network is improved. For descriptive convenience, we usually consider each piece of information (e.g., a microblog) obtained by a user on a social network as a document, which has a unique network address.

To implement the above two technical solutions, it is necessary to be able to obtain personalized features of the user and the web document. However, it is often difficult to obtain personalized features of users and web documents on the internet, and there are several major difficulties. The first is the problem of automatic retrieval of personalized information. It is estimated that there are billions of web pages and billions of users on the internet today, and it is impractical to manually maintain the web page documents and personalized features of the users. How to automatically acquire personalized features of users and web documents is a difficult problem. The second is the update problem of personalized information. As time goes by, personal information such as user's hobbies, workplaces, engaged industries, and education levels may change, but it is difficult to require most users to update their personalized information in real time. Thirdly, the semantic difference problem of the personalized information. Among personalized features set by a user, personalized features with different terms but the same semantics are difficult to effectively classify. Fourth is the completeness of the personalized information. The personal information provided by the user on the website is generally short. For example, the description of the user's interests is often a description of how much music, baseball, or book is enjoyed, and it is difficult to ask the user to fully describe the areas of interest.

In summary, how to effectively obtain personalized features of users and documents, and improve precision ratio of search engines and information retrieval efficiency of social networks according to the personalized features is a problem to be solved urgently.

Disclosure of Invention

In view of the problems in the prior art, the present invention aims to provide a method and a system for obtaining personalized features of users and documents, which automatically obtain the personalized features of the users and the documents, and help the users filter and filter the information obtained by the users on the internet according to the personalized features.

In accordance with the above-mentioned objects, the present invention proposes a method for obtaining personalized features of users and documents, characterized in that,

storing a user set U consisting of a plurality of user identifications and a document set D consisting of a plurality of document identifications in a server accessed to the Internet; storing a feature set K consisting of a plurality of feature identifications;

setting an initial value of a parameter vector for at least one user in the user set U or one document in the document set D in the server;

in the server, the following steps are performed a plurality of times:

receiving a signal that any user m (m belongs to U) accesses any document n (n belongs to D);

reading a parameter vector u (m) ═ uwm1, uwm 2.,. uwmk.,. uwmml) of the user m from the signal, wherein uwmk represents a degree of correlation of the user m with a feature K (K e K);

reading a parameter vector d (n) of the document n according to the signal, dwn1, dwn 2., dwnk.,. dwnL, wherein dwnk represents a degree of correlation of the document n with a feature K (K e K);

applying a parameter vector updating algorithm to update the parameter vectors of the user m and the document n; assuming that the parameter vector of the updated user m is U (m) — (uwm1, uwm 2.,. mu, mk.,. mu, and the parameter vector of the updated document n is D (n) — (dwn1, dwn 2.,. dwnk.,. dwnL), the parameter vector updating algorithm includes:

U*(m)＝F1[U(m)，D(n)]；

D*(m)＝F2[U(m)，D(n)]；

wherein said F1 (-) and said F2 (-) are functions of said U (m) and said D (n), respectively, as arguments.

Compared with the prior art, the method and the device can realize personalized document sequencing, and further improve the precision ratio of a search engine and the information retrieval efficiency of a social network. In addition, the anti-cheating capability of the webpage sorting algorithm can be improved by utilizing the personalized features of the webpage documents.

Drawings

FIG. 1 is a method for representing a parameter vector for each user in a user set U;

FIG. 2 is a method of representing a parameter vector for each document in a document set D;

FIG. 3 is a flow chart of a parameter vector update algorithm for users and documents;

FIG. 4 is a rank vector representation of each document in the document set D;

FIG. 5 is a flowchart of a document rank vector update algorithm;

FIG. 6 is a flow chart of a method for personalized document retrieval based on query vectors and rank vectors;

FIG. 7 is a flow chart of a method for personalized document retrieval based on query vectors and parameter vectors;

FIG. 8 is a block diagram of a system for obtaining personalized features for users and documents;

Detailed Description

The method of the present invention will be described in further detail with reference to the accompanying drawings.

The specific embodiments of the process of this patent are illustrated and include the following sections. Firstly, explaining the meanings of a user set, a document set and a feature set and a parameter vector representation method of the user and the document; then, explaining a parameter vector updating algorithm of the user and the document; then, a document sorting vector representation method and a document sorting algorithm based on the document parameter vector are explained; then, a personalized document retrieval method based on the query vector is explained; finally, a system for obtaining personalized features of users and documents is described.

First, the meanings of the user set U, the document set D, and the feature set K are explained.

In a server accessing the Internet, a user set U composed of a plurality of user identifications and a document set D composed of a plurality of document identifications are stored. The user identification is a unique identification code of a user on the Internet and comprises one of a user account number, a mobile phone number, a Cookie identification code, an IP address, an Email address and an instant communication number; the document identification is a unique identification of the document on the internet, such as a URL address of a Web page document. The user set U comprises M elements, and the document set D comprises N elements.

In a server accessed to the Internet, a feature set K consisting of a plurality of feature identifiers is stored, and the feature set K comprises L elements. The features in the feature set K are selected from the features of the users in the user set U and the features of the documents in the document set D. The user and the document use the same feature set K. If the user has a "music" feature, it indicates that the user likes music, and the document has a "music" feature, indicating that the document is related to a musical theme.

The following describes a method of representing a parameter vector of a user and a document. The parameter vector representation method is similar to the vector representation method of the vector space model VSM, namely, the feature item is used as a basic unit of the user feature or the document feature. In the method and the system, a set of the relevancy between the user and each feature is used as a parameter vector of the user, and a set of the relevancy between the document and each feature is used as a parameter vector of the document.

Fig. 1 shows a method for representing a parameter vector of each user in a user set U. The parameter vector of any one user m (m e U) in the user set U is set to U (m) ═ uwm1, uwm 2.., uwmk.., uwmml), where uwmk represents the degree of correlation of the user m with the feature K (K e K). In addition, the relevance of each user in the user set U and the feature k are collected together to form a vector, which is called a k-th user column vector of the user set U (uw1k, uw2 k.., uwMk).

FIG. 2 is a method for representing a parameter vector for each document in the document set D. The parameter vector of any one document n (n e D) in the document set D is set as D (n) ═ dwn1, dwn2,. multidrop,. dwnk.. multidrop L, wherein dwnk represents the degree of correlation of the document n with the feature K (K e K). In addition, the relevance of each document in the document set D and the feature k are combined together to form a vector, which is called a kth document column vector (dw1k, dw2 k.,. dwNk) of the document set D.

The degree of relevance is a real number that represents how closely a user or document is related to a feature in the feature set K. If a user or document is associated with more music features and less music features than sports features, we say that the user or document has a high correlation with music features and a low correlation with sports features. In addition, when the features are selected, some features have correlation, so that the dimensionality of the feature set K can be reduced by reducing the correlation among the features, the requirement on a server storage space is reduced, and the algorithm efficiency is improved. Some features need not be listed directly in the feature set because the relevance of these features can be calculated from the relevance of one or several other features in the feature set K.

The following explains a setting method of an initial value of a parameter vector of a user or a document. The following three examples are given for illustration. The initial value range of the parameter vector for a user or document is typically set to uwmk e [0, 1] and dwnk e [0, 1] for any of m e U, n e D and K e K. If the parameter vector of the user or document is not set to an initial value, its initial value of the parameter vector is set to a zero vector by default.

Example 1 is a method of manually setting the initial value of a parameter vector of a user m (m ∈ U) or a document n (n ∈ D). For example, set total number of features L to 5, set feature set K to (science, education, finance, music, sports), and set u (m) to (uwm1, uwm2, uwm3, uwm4, uwm5) to (0, 0.9, 0, 1, 0). That is, the degree of correlation between the user m and the "education" feature is 0.9, the degree of correlation between the user m and the "music" feature is 1, and the degrees of correlation between the user m and other features are all zero. Similarly, an initial value of a parameter vector d (n) (dwn1, dwn 2.., dwnk.,. dwnL) of the document n may be set.

Example 2 is a method of setting the initial value of the parameter vector of user m (m ∈ U). First submitting a set of document collections by said user m

The parameter vector of the document r (r e H) is (dwr1, dwr 2., dwrL), and then for each K e K, uwmk ═ σ 1/s · (r e H) dwrk or uwmk · (σ 1/s) · (Σ e H) [ dwrk/(Σ (K e K) dwrk) is set]Which isWhere s is the number of elements in the set H, and σ 1 is a set normal number. Using a similar method, the user m may also select a group of users in the user set U to calculate the initial value of the parameter vector of the user m.

Example 3 is a method of setting initial values of parameter vectors of a document. A category list is a special document, such as a web portal, which typically includes categories for news, music, sports, finance, and technology. We assume that documents under the same category list have some of the same characteristics, e.g., documents under the sports list are all related to sports. If document n (n e D) is a document under the classification directory h (h e D), the initial value of the parameter vector of the document n is determined by the parameter vector of the classification directory h. For example, for each K ∈ K, dwnk is set to σ 2 · dwhk, where σ 2 is a set positive constant.

FIG. 3 is a flow chart of a parameter vector update algorithm for users and documents. The method specifically comprises the following steps of executing in a server accessed to the Internet:

s11, storing a user set U consisting of a plurality of user identifications and a document set D consisting of a plurality of document identifications; storing a feature set K consisting of a plurality of feature identifications;

s12, setting an initial value of a parameter vector for at least one user in the user set U or one document in the document set D;

s13, receiving a signal that any user m (m belongs to U) accesses any document n (n belongs to D);

s14. reading a parameter vector u (m) ═ of the user m (uwm1, uwm 2.., uwmk., uwmml) according to the signal, wherein uwmk represents a degree of correlation of the user m with a feature K (K e K);

s15, reading a parameter vector D (n) (dwn1, dwn2, a., dwnk, a., dwnL) of the document n according to the signal, wherein the dwnk represents the correlation degree of the document n and a characteristic K (K is equal to K);

s16, updating the parameter vectors of the user m and the document n by applying a parameter vector updating algorithm; assuming that the updated parameter vector U (m) of the user m is (uwm1, uwm 2., uwmk.,. and uwmL.), and the updated parameter vector D (n) of the document n is (dwn 1.,. dwn 2.,. and dwnk.,. and dwnL), the algorithm includes:

U*(m)＝F1[U(m)，D(n)]；

D*(m)＝F2[U(m)，D(n)]；

after the step S16 is completed, the process returns to the step S13.

Wherein said F1 (-) and said F2 (-) are functions of said U (m) and said D (n), respectively, as arguments. The user m represents any user in the user set U without referring to a certain user, and the document n represents any document in the document set D without referring to a certain document. For example, when step S13 is executed the nth time m is 1023 and n is 3428, and when step S13 is executed the n +1 th time m is 33456 and n is 28477.

In an example of application of the method of fig. 3, for each K e K, uwmk is an increasing function of dwnk, and dwnk is an increasing function of uwmk.

In one example application of the method of FIG. 3, for each K e K, uwmk and dwnk are both decreasing functions of the frequency with which the user m accesses the document set D. The frequency is the number of times the user m accesses the documents in the document set D within a set period of time divided by the length of the set period of time.

In one example of application of the method described in fig. 3, for each K e K, uwmk is a decreasing function of Σ (K e K) dwnk, which is a decreasing function of Σ (K e K) uwmk.

In one example of the method of fig. 3, the signal is randomly extracted from the Web log for a set time. And extracting the same number of access signals as input signals of the method in the figure 3 for each active user in the user set U within the set time. The active users refer to users who access the document set D for a set number of times within the set time. Inactive users cannot update the parameter vectors of users and documents using the method described in fig. 3.

In the method shown in fig. 3, after the parameter vector updating algorithm is executed for a set number of times t1, normalizing the kth user column vector (uw1K, uw2K,., uwMk) under each feature K ∈ K; after the parameter vector updating algorithm is executed for a set time t2, normalizing the kth document column vector (dw1K, dw 2K.,. dwNk) under each feature K ∈ K; wherein t1 and t2 are positive integers. The parameter vector updating algorithm is executed once, that is, the step S16 is executed once. The normalization method includes the following specific application examples.

Example 1: the method for normalizing the kth user column vector (uw1k, uw2 k.., uwMk) in the user set U is as follows: the set { uw1k, uw2 k.., uwMk) is ordered in descending order, and the element of the rank M1 is assigned to

And for each m ∈ U, if

If so, set uwmk to 1, otherwise set

. The method for normalizing the kth document column vector (dw1k, dw2 k.., dwNk) in the document set D is as follows: the set { dw1k, dw2 k.., dwNk) is ordered in descending order, with the element ranking N1 assigned to

And for each n ∈ D, if then set

Dwnk is 1, otherwise set

. Where M1 and N1 are set to normal numbers.

Example 2: the method for normalizing the kth document column vector (dw1k, dw2 k.., dwNk) in the document set D is as follows: firstly, sorting a set { dw1k, dw2 k.., dwNk), and dividing the set { dw1k, dw2 k.., dwNk } into r groups with approximately equal numbers of elements according to a sorting result, wherein the relationship between any two groups of a group and b group is that any one element in the a group is greater than or equal to any one element in the b group, or any one element in the a group is less than or equal to any one element in the b group; taking one data composition set { s1, s 2.,. sr } with the smallest value in each group, and s1 < s 2. < sr; then, for each n ∈ D, dwnk is set to 0 if dwnk < s 1; if sm ≦ dwnk ≦ sm +1, set dwnk ≦ g1 (sm); if dwnk > sr, dwnk is set to 1. Where g1(sm) is an increasing function, g1(sm) e (0, 1), for example, g1(sm) sm/sr; m is more than or equal to 1 and less than r, and r is a set positive number. In the same way, the k-th user column vector in the user set U can be normalized.

In an example application of the method shown in fig. 3, after the step S16 is executed, the method further includes setting uwmk ═ and dwnk ═ dwnk ∈ for each K ∈ K.

In one example of an application of the method described in FIG. 3, the method satisfies that for each K ∈ K, there are uwmk ≧ uwmk and dwnk ≧ dwnk.

In the method of fig. 3, the type of the signal is at least one of the following types: t ═ 1 represents that the user m clicks the link of the document n, T ═ 2 represents that the user m types the address of the document n, T ═ 3 represents that the user m sets the document n to Like (e.g., +1 of Like and *** of facial makeup), T ═ 4 represents that the user m forwards the document n, T ═ 5 represents that the user m reviews the document n, and T ═ 6 represents that the user m collects the document n.

Application example 1

In an application example of the method shown in fig. 3, the parameter vector updating algorithm specifically includes:

uwmk ═ β 1. uwmk + λ 1(n, m, T). f1(dwnk) (for each K ∈ K)

dwnk ═ β 2 · dwnk + λ 2(m, n, T) · f2(uwmk) (for each K ∈ K)

Wherein λ 1(n, m, T) is an influence coefficient of the document n on the user m under the type T of the signal, and λ 2(m, n, T) is an influence coefficient of the user m on the document n under the type T of the signal; β 1 and β 2 are set normal numbers; the f1(dwnk) is an increasing function of the dwnk, and the f2(uwmk) is an increasing function of the uwmk. For example, f1(dwnk) ═ σ 3 · dwnk, f2(uwmk) ═ σ 4 · uwmk; or f1(dwnk) · σ 5 · {1/[1+ exp (-dwnk) ] }, f2(uwmk) · σ 6 · {1/[1+ exp (-uwmk) ] }, where σ 3, σ 4, σ 5, and σ 6 are set normal numbers.

In the application example 1, for each feature K ∈ K, a threshold value dCk is set for the kth document column vector, and if dwnk is less than or equal to dCk, f1(dwnk) is set to 0; for each feature K e K, a threshold uCk is set for the kth user column vector, and if uwmk ≦ uCk, f2(uwmk) ≦ 0. Wherein dCk is equal to the component ranked at the a1 th among the components of the kth document column vector (dw1k, dw2 k.., dwNk); uCk is equal to the component ranked at the a2 th among the components of the kth user column vector (uw1k, uw2 k.., uwMk); a1 and a2 are set positive integers.

In the application example 1, specific implementation methods of the λ 1(n, m, T) and the λ 2(m, n, T) include the following examples:

example 1: let λ 1(n, m, T) and λ 2(m, n, T) be set constants. For example λ 1(n, m, T) ═ c1 and λ 2(m, n, T) ═ c2, where c1 and c2 are set normal numbers, e.g., c1 ═ c2 ═ 0.01.

Example 2: the λ 1(n, m, T) and the λ 2(m, n, T) are each a decreasing function of the frequency with which the user m accesses the document set D. If λ 1(n, m, T) ═ 1/g2[ freq (m) ], λ 2(m, n, T) ═ 1/g2[ freq (m) ], g2(x) is an increasing function. For example, g2(x) is a piecewise function, and when x < a3, g2(x) is 1; when x is larger than or equal to a3, g2(x) is 1+ a4(x-a3), wherein a3 and a4 are preset normal numbers. The freq (m) is the frequency of access to documents in the document set D by the user m.

Example 3: let λ 1(n, m, T) ═ 1/g3[ ∑ (K ∈ K) dwnk ], λ 2(m, n, T) ═ 1/g3[ Σ (K ∈ K) uwmk ], and g3(x) be an increasing function. For example, g3(x) is a piecewise function, and when x < a5, g3(x) is 1; when x is larger than or equal to a5, g3(x) is 1+ a6(x-a5), wherein a5 and a6 are preset normal numbers. When calculating sigma (K epsilon K) dwnk, if dwnk is less than or equal to min _ dCk, then dwnk is taken to be 0; when calculating sigma (K belongs to K) uwmk, if uwmk is less than or equal to min _ uCk, then taking uwmk as 0; where min _ dCk and min _ uCk are set normal numbers.

Example 4: λ 1(n, m, T) ═ D1(n) · U2(m), λ 2(m, n, T) · U1(m) · D2(n), where D1(n) indicates whether the parameter vector of document n can be used to update the parameter vectors of users in user set U, U2(m) indicates whether the parameter vector of user m can be updated by the parameter vector of documents in document set D, U1(m) indicates whether the parameter vector of user m can be used to update the parameter vector of documents in document set D, and D2(n) indicates whether the parameter vector of document n can be updated by the parameter vectors of users in user set U. u1(m), u2(m), d1(n) and d2(n) are preset parameters, and the values of the parameters are 0 or 1. 1 represents yes and 0 represents no. The meaning of this example is that in order to prevent malicious attacks, some documents (or users) cannot update the parameter vectors of other users (or documents) because the documents (or users) are not authenticated by reliability; some important documents (or users) have their parameter vectors not updated by the parameter vectors of other users (or documents).

Example 5: the λ 1(n, m, T) ═ s1(T), and the λ 2(m, n, T) ═ s2 (T). Wherein T is a type of user access document signal, and s1(T) and s2(T) are functions of T, respectively.

Example 6: the λ 1(n, m, T) is an increasing function of the number of times the document n is accessed or the PageRank value, and the λ 2(m, n, T) is an increasing function of the number of fans (fans) of the user m.

Example 7: the λ 1(n, m, T) and the λ 2(m, n, T) are increasing functions of the similarity sim (m, n) between the parameter vectors of the user m and the document n, respectively. For example, λ 1(n, m, T) ═ 1+ c3 · sim (m, n), λ 2(m, n, T) ═ 1+ c4 · sim (m, n), where c3 and c4 are set constants equal to or greater than 1, and sim (m, n) [ Σ (K ∈ K) (uwmk · dwnk) ]/{ [ Σ (K ∈ K) (uwmk)2]1/2 [ Σ (K ∈ K) (dwnk)2]1/2 }. The implication of this example is that the higher the similarity between the parameter vectors of the user and the document, the larger the scaling factor they "vote" for each other. When sim (m, n) is calculated, if dwnk is less than or equal to min _ dCk, dwnk is taken to be 0; if uwmk ≦ min _ uCk, then uwmk is taken to be 0, where min _ dCk and min _ uCk are the set normal numbers.

Example 8: the λ 1(n, m, T) and λ 2(m, n, T) are generated using a combination of at least two of the methods of examples 1-7 above. For example, when freq (m) > a3, there are

λ1(n,m,T)＝c1·{1+c3·sim(m,n)}·{1/[1+a4(freq(m)-a3)]}·{d1(n)·u2(m)}·s1(T)

λ2(m，n，T)＝c2·{1+c4·sim(m，n)}·{1/[1+a4(freq(m)-a3)]}·{u1(m)·d2(n)}·s2(T)。

In the application example 1, after the specific parameter vector update algorithm is executed for the set number of times, the kth document column vector (dw1K, dw 2K.., dwNk) and the kth user column vector (uw1K, uw 2K.., uwMk) need to be normalized for each feature K ∈ K, respectively.

Application example 2

This is a specific implementation of application example 1. For convenience of explanation, assume that there are two users and three documents on the internet, each user and each document having two features, i.e., user set U ═ {1, 2}, document set D ═ {1, 2, 3}, and feature set K ═ 1, 2 }. The parameter vectors of user 1 and user 2 are (uw11, uw12) and (uw21, uw22), respectively, and the parameter vectors of document 1, document 2, and document 3 are (dw11, dw12), (dw21, dw22), and (dw31, dw32), respectively. Wherein uwmk (m is equal to U, K is equal to K) represents the correlation degree of the user m and the characteristic K; dwnk (n ∈ D, K ∈ K) represents the relevance of the document n to the feature K.

Assuming that a signal for the user 2 to access the document 3 is received in the server and the signal type T is 1, the parameter vectors of the user 2 and the document 3 are updated according to the following parameter vector update algorithm:

uw21*＝β1·uw21+λ1(3，2，1)·dw31；uw22*＝β1·uw22+λ1(3，2，1)·dw32

dw31*＝β2·dw31+λ2(2，3，1)·uw21；dw32*＝β2·dw32+λ2(2，3，1)·uw22

wherein β 1 ═ β 2 ═ 1; λ 1(3, 2, 1) represents an influence coefficient of the document 3 on the user 2 when the signal type T is 1; λ 2(2, 3, 1) represents an influence coefficient of the user 2 on the document 3 when the signal type T is 1. For example:

λ1(3，2，1)＝c1·{1+c3·sim(2，3)}·{1/[1+a4(freq(2)-a3)]}·{d1(3)·u2(2)}·s1(1)

λ2(2，3，1)＝c2·{1+c4·sim(2，3)}·{1/[1+a4(freq(2)-a3)]}·{u1(2)·d2(3)}·s2(1)

where c1 ═ c2 ═ 0.01, c3 ═ c4 ═ 3, sim (2, 3) ═ uw21 · dw31+ uw22 · dw32)/{ [ (uw21)2+ (uw22)2]1/2 · [ (dw31)2+ (dw32)2]1/2}, a3 ═ 200, a4 ═ 0.01, d1(3) ═ u2(2) ═ u1 (2): d2(3) ═ 1, s1(1) ═ 2, s2(1) ═ 1, and freq (2) > a 3.

After the above parameter vector updating algorithm is executed, the following settings are performed: uw21 ═ uw21 ═ uw22 ═ uw22 ═ dw31 ═ dw31 ═ dw32 ═ dw 32.

After the above parameter vector updating algorithm is executed, the user column vectors (uw11, uw21) and (uw12, uw22) are normalized, and the document column vectors (dw11, dw21, dw31) and (dw12, dw22, dw32) are normalized.

The algorithm for the normalization process of the user column vectors is as follows: if temp1 is max (uw11, uw21), uw11 is uw11/temp1, uw21 is uw21/temp1 for the feature k is 1; if temp2 is max (uw12, uw22), uw12 is uw12/temp2, and uw22 is uw22/temp2 for feature k 2.

The algorithm for the normalization process of the document column vector is as follows: if temp1 is max (dw11, dw21, dw31), then dw11 is dw11/temp1, dw21 is dw21/temp1, dw31 is dw31/temp1 for the feature k 1; let temp2 ═ max (dw12, dw22, dw32), then dw12 ═ dw12/temp2, dw22 ═ dw22/temp2, dw32 ═ dw32/temp2 are set for characteristic k ═ 2.

FIG. 4 is a method of rank vector representation for each document in the document set D.

The core technology of search engines is the ranking algorithm, the most notable of which is the PageRank algorithm. The standard PageRank algorithm can be expressed as follows.

(1)

The set T is a link-in webpage set of a webpage p (p belongs to D), and C (i) is the number of link-out webpages of a webpage i (i belongs to T); d represents the probability of the user accessing the web page p through the links of other web pages; 1-d represents the probability that the user does not access the web page p via a link to another web page (e.g., by typing in a URL address, etc.), d ∈ (0, 1); PR (p) represents the ranking value of the web page p in the document set D, and N represents the number of web pages in the document set D. In addition, the initial ranking value of each web page is set to 1/N. Here, each element in the document set D is a web page.

Standard PageRank (the disadvantage of the algorithm is that each web page on the internet has only one unique web page ranking value, i.e., the algorithm assumes that each user's evaluation of the importance of the same web page is the same.

The traditional PageRank value is expanded, namely a one-dimensional ranking value PR (p) of any document p in the document set D is expanded into a multi-dimensional ranking vector based on domain features. Let the rank vector of any document p (p e D) be [ PR (p, 1), PR (p, 2),.., PR (p, k),. once, PR (p, L)]Wherein PR (p, K) represents the ranking value of the document p in the document set D under the characteristic K (K ∈ K). Collecting the ranking values of each document under the characteristic K belonging to K together to form a vector, namely the K-th ranking vector of the document set D

。

FIG. 5 is a flowchart of a document rank vector update algorithm. Let the document set D contain at least two subsets of documents, wherein the subsets of documents S (A), (B), (C), (D

) Each document in (a) contains at least one link to other documents in the document set D, while the subset of documentsE（

) Each document of the subset S of documents is pointed to by a link contained in at least one document of the subset S of documents; and S ≧ E ≠ D, S ≠ E ≠ Φ, where Φ is an empty set. Therefore, the rank vector update algorithm is as follows: the ranking value of any document p in the document set D under the characteristic K (K belongs to K) is a function of the ranking value of each link-in document of the document p under the characteristic K and the relevance of the link-in document and the characteristic K.

The ordering vector update algorithm includes the following two specific application examples.

Example 1: the ranking value of any document p (p e D) in the document set D under the characteristic K e K is defined as:

（2）

wherein the set T is a linked document set of the document p; d represents the probability of a user accessing the document p through links of other documents; 1-d represents the probability that a user will not access the document p by a link to another document (e.g., by typing in a URL address, etc.), d ∈ (0, 1); PR (i, K) represents the ranking value of the document i under the characteristic K (K belongs to K); the dwik represents the relevance of the document i and the characteristic K (K belongs to K); n is the number of documents in the document set D. In addition, for each document i ∈ D and each feature K ∈ K, the initial ranking value PR (i, K) of the document i is set to 1/N.

The formula (2) can be expressed in the form of a vector as follows:

（3）

wherein, K is equal to K,

；

a column vector of all 1's; a is a non-negative matrix, and a ═ (aij) N × N is defined as follows:

example 2: the ranking value of any document p (p e D) in the document set D under the characteristic K e K is defined as:

（4）

wherein the set T: (

) A linked document set for the document p; d represents the probability of a user accessing the document p through links of other documents; 1-d represents the probability that a user will not access the document p by a link to another document (e.g., by typing in a URL address, etc.), d ∈ (0, 1); PR (i, K) represents the ranking value of the document i under the characteristic K (K belongs to K); the dwik represents the relevance of the document i and the characteristic K (K belongs to K); c (i) represents the number of linked-out documents of document i (i ∈ T); n is the number of documents in the document set D. Further, for each document i e D and each feature K e K, the initial ranking value PR (i, K) of the document i is set to 1/N.

The vector form of said formula (4) can also be expressed in the form of formula (3), wherein

，

A column vector of all 1's; the non-negative matrix a ═ (aij) N × N is defined as follows:

in order to ensure the validity of the formula (3), it is necessary to perform several restrictions on the link relationship between the documents in the document set D, for example, to remove the Dangling Page (Dangling Page) and each link pointing to it, to restore the Dangling Page and the link pointing to it after the calculation of the ranking value of other documents is completed, and to calculate the ranking value of the Dangling Page according to the formula (3).

The formula (3) can approximate its solution by a Power iteration Method (Power Method), i.e. calculate the k-th rank-sequence vector in the document set D

. After the nth iteration, the rank sequence vector is

Then the power iteration method comprises the following steps:

r10, selecting any characteristic K belonging to K;

r11, generating a non-negative matrix A according to the formula (2) or the formula (4);

r12, setting the initial value of the k-th rank sequence vector in the document set D

n＝0；

R13. executing said formula (3), i.e. rank-sequence vector according to step n

To calculate the rank sequence vector of step n +1

I.e. by

R14. for the

Performing a normalization process, i.e.

R15. judging whether to perform

Or n > STEP, if yes, ending; otherwise, if n is equal to n +1, the procedure returns to step R13.

Wherein ε and STEP are set normal numbers;

representing a vector

The largest component modulo.

FIG. 6 is a flow chart of a method for personalized document retrieval based on a query vector and a rank vector. The method comprises the following steps executed in the server:

s10, updating the parameter vectors of the plurality of documents in the document set D and the parameter vectors of the plurality of users in the user set U according to the parameter vector updating algorithm; the specific implementation method comprises the steps from S11 to S16 in FIG. 3;

s20, setting an initial value of a sequencing vector of each document in the document set D;

s30, under each characteristic K (K belongs to K), applying the ranking vector updating algorithm to iteratively update the kth ranking vector in the document set D, namely updating the ranking vector of each user in the document set D;

s40, receiving a query vector set by a user q (q belongs to D) and a search condition submitted by the user q, and extracting a search keyword in the search condition; wherein the search condition can be set to all information submitted by the user in the search dialog;

s50, retrieving a group of documents Q matched with the search keywords in the document set D;

s60, calculating the personalized ranking value of each document in the group of documents Q according to the query vector and the ranking vector of each document in the group of documents Q;

s70, sorting the group of documents Q according to the personalized sorting value, and sending links of a plurality of documents in the group of documents Q to the user Q according to a sorting result.

In the method of FIG. 6, let the query vector of user q be (swq1, swq 2.,. swqk.,. swqL), where swqk represents the rank value of the queried document in the document set D under the feature K (K ∈ K), and swqk ∈ [0, 1 ]. The setting method of the query vector is exemplified as follows.

The first is that the user n selects a feature in the feature set K and sets the rank value of the queried document, for example, set swq 2-0.00023, swq 6-0.00061, and the other vector component is 0.

The second is that the user q submits a set of document identifications Sq. The ranking vector of the document r (r e Sq) is [ PR (r, 1), PR (r, 2),. the next, PR (r, K),. the next, PR (r, L) ], so for each feature K e K, the query vector of the user q is set to swqk ═ σ 7/s · (r e Sq) PR (r, K) or swqk · (σ 7/s) · (r e Sq) { PR (r, K)/∑ (K e K) PR (r, K) }; where s is the number of elements of the set Sq, and σ 7 is a set normal number.

In an example of application of the method of fig. 6, the personalized ranking value UR (i, Q) of the document i (i e Q) based on the query vector submitted by the user Q is defined as a similarity between the query vector (swq1, swq 2.., swqk.., swqL) of the user Q and the ranking vector [ PR (i, 1), PR (i, 2),..., PR (i, k.,. PR (i, L) ], e.g., PR (i, L) ], of the document i

UR(i，q)＝∑(k∈K)[PR(i，k)·swqk]}/{[∑(k∈K)(PR(i，k))2]1/2·[∑(k∈K)(swqk)2]1/2}

Wherein PR (i, K) represents the ranking value of the document i in the document set D under the characteristic K (K e K), and swqk represents the ranking value of the queried document in the document set D under the characteristic K (K e K). When UR (i, q) is calculated, for any K epsilon K, if PR (i, K) < min _ PR, PR (i, K) ═ 0 is taken; if swqk is less than min _ SW, then swqk is 0. min _ PR and min _ SW are set normal numbers.

FIG. 7 is a flow chart of a method for personalized document retrieval based on query vectors and parameter vectors. The method comprises the following steps executed in the server:

A10. updating the parameter vectors of the plurality of documents in the document set D and the parameter vectors of the plurality of users in the user set U according to the parameter vector updating algorithm; the specific implementation method comprises the steps from S11 to S16 in FIG. 3;

A20. receiving a query vector set by a user q (q epsilon is D) and a search condition submitted by the user q, and extracting a search keyword in the search condition; wherein the search condition can be set to all information submitted by the user in the search dialog;

A30. retrieving a set of documents Q matching the search keyword in the document set D;

A40. calculating a personalized ranking value of each document in the set of documents Q according to the query vector and the parameter vector of each document in the set of documents Q;

A50. and sorting the group of documents Q according to the personalized sorting value, and sending links of a plurality of documents in the group of documents Q to the user Q according to a sorting result.

In the method of fig. 7, let the query vector of the user q be (swq1, swq 2.,. swqk.,. swqL), where swqk represents the relevance of the queried document to the feature K (K ∈ K), and swqk ∈ [0, 1 ]. The query vector has several setting methods as follows.

The first is that the user n selects a feature from the feature set K and sets a feature relevance to it, for example, setting swq 2-0.8, swq 6-0.9, and the other vector component is 0.

The second is to assign the parameter vector of the user q to the query vector.

The third is that the user q submits a set of user identities or document identities Sq. When in use

Then, the user r (r e Sq) has a parameter vector of (uwr1, uwr 2.., uwrL), so the user q query vector is set to (σ 8/s) · Σ (r e Sq) uwrk or (σ 8/s) · Σ (r e Sq) uwrk [ uwrk/(Σ (K e K) uwrk) for each feature K e K, swqk · K, or swqk · for each feature K e K](ii) a When in use

Then, the parameter vector of the document r (r e Sq) is (dwr1, dwr 2.., dwrL), so the query vector of the user q is set to (σ 9/s) · (r e Sq) dwrk or (σ 9/s) · (r e Sq) [ dwrk/(Σ (K e K) dwrk) for each feature K e K, swqk · (σ 9/s) · (r e Sq) dwrk, for each feature K e K, dwrk, and so on](ii) a Where s is the number of elements of the set Sq, and σ 8 and σ 9 are set normal numbers.

In an example of application of the method of fig. 7, the personalized ranking value UR (i, Q) of the document i (i e Q) based on the query vector submitted by the user Q is defined as the similarity between the query vector (swq1, swq 2.., swe, swqk.., swqL) of the user Q and the parameter vector (dwi1, dwi 2.., dwiL) of the document i, i.e., the similarity between the query vector (swq1, swq 2.., swqL) and the parameter vector (dwi1, dwi 2.., dwiL)

UR(i，q)＝[∑k(swqk·dwik)]/{[∑k(swqk)2]1/2·[∑k(dwik)2]1/2}。

One application scenario for the method of FIG. 7 is microblog. After a user issues a microblog document, the initial value of the parameter vector of the microblog document can be set, namely, the parameter vector of the user who issues the microblog is multiplied by a preset constant and assigned to the parameter vector of the microblog document. After receiving a signal (such as a signal generated by actions of forwarding, commenting or collecting) of a user accessing a microblog document on a microblog server, respectively reading a parameter vector of the user and a parameter vector of the microblog document according to a user identifier and a microblog document identifier contained in the signal; and then updating the parameter vectors of the user and the microblog documents according to a parameter vector updating algorithm. When the user opens the microblog, the user can filter and screen the information issued by other people in the relational network through the preset query vector. The method comprises the steps of presetting a query vector by a user, taking the similarity between the query vector and a parameter vector of each microblog document received by the user as an individualized ranking value of each microblog document, and filtering and screening the microblog documents received by the user according to the numerical value of the individualized ranking value. For example, only microblog documents with the personalized ranking values ranked 30% top are sent to the query user.

FIG. 8 is a block diagram of a system for obtaining personalized features for a user and a document. The system 200 comprises the following functional modules:

user set, document set, and feature set setting module 211: storing a user set U composed of a plurality of user identifications in the user database 220, and storing a document set D composed of a plurality of document identifications in the document database 230; storing a feature set K consisting of a plurality of feature identifiers in the feature database 240;

user and document initial value setting module 212: setting an initial value of a parameter vector for at least one user in the user set U and storing the initial value in a user database 220; setting an initial value of a parameter vector for at least one document in the document set D and storing the initial value in the document database 230; setting an initial value of a ranking vector for each document in the document set D; the initial value of the parameter vector of the user and the document which are not set with the initial value of the parameter vector is a zero vector by default;

the user accesses the document signal collection module 213: the system comprises a Web log database 250, a document acquisition module, a document analysis module and a document analysis module, wherein the Web log database 250 is used for acquiring signals of any user m (m belongs to U) (102) accessing any document n (n belongs to D); a signal for said user m (102) to access said document n, to be sent to at least one application server, said application server comprising a web portal server 301, a social network server 302, a search engine server 303 and an instant messaging server 304;

user and document parameter vector update module 214: reading the parameter vector of the user m (102) in the user database 220 and the parameter vector of the document n in the document database 230 according to the signal, then applying a parameter vector updating algorithm to update the parameter vectors of the user m (102) and the document n, and finally updating the user database 220 and the document database 230 with the updated parameter vector of the user m (102) and the updated parameter vector of the document n, respectively;

document rank vector update module 215: in the document set D, using the link relationship between documents, the initial value of the ranking vector of each document, and the parameter vector of each document as input data, applying a ranking vector update algorithm, iteratively updating the ranking value of each document in the document set D under each feature K (K e K), and applying the updated ranking value to update the document database 230; the link relation between the documents is determined by the document link contained in each document package in the document set D;

the user query module 216: firstly, receiving a query vector set by a query user q and a search condition submitted by the user q, and extracting a search keyword from the search condition; then, retrieving a group of documents Q matching the search keyword in the document set D; then, calculating the personalized ranking value of each document in the group of documents Q according to the query vector and the ranking vector of each document in the group of documents Q, or calculating the personalized ranking value of each document in the group of documents Q according to the query vector and the parameter vector of each document in the group of documents Q; and finally, sorting the group of documents Q according to the personalized sorting value, and sending links of a plurality of documents in the group of documents Q to the user Q according to a sorting result.

The above-mentioned application examples are only preferred application examples of the present invention, and are not intended to limit the scope of the present invention.

Claims

1. A system for obtaining personalized features of users and documents is characterized by comprising the following functional modules:

a user set, document set and feature set setting module: storing a user set U consisting of a plurality of user identifications in a user database, and storing a document set D consisting of a plurality of document identifications in a document database; storing a feature set K consisting of a plurality of feature identifications in a feature database;

user and document initial value setting module: setting an initial value of a parameter vector for at least one user in the user set U and storing the initial value of the parameter vector in a user database; setting an initial value of a parameter vector for at least one document in the document set D and storing the initial value in a document database; setting an initial value of a ranking vector for each document in the document set D; the initial value of the parameter vector of the user and the document which are not set with the initial value of the parameter vector is a zero vector by default;

the user accesses the document signal acquisition module: the system comprises a Web log database, a document acquisition module, a document analysis module and a document analysis module, wherein the Web log database is used for acquiring signals of any user m (m belongs to U) accessing any document n (n belongs to D), and the signals are stored in the Web log database;

user and document parameter vector update module: reading the parameter vector of the user m in the user database and the parameter vector of the document n in the document database according to the identifications of the user m and the document n contained in the signal; then updating the parameter vectors of the user m and the document n through a parameter vector updating algorithm; finally, the user database and the document database are respectively updated by the updated parameter vectors of the user m and the document n;

the document ordering vector updating module: in the document set D, taking the link relation among the documents, the initial value of the ranking vector of each document and the parameter vector of each document as input data, applying a ranking vector updating algorithm, iteratively updating the ranking value of each document in the document set D under each characteristic K (K belongs to K), and applying the updated ranking value to update the document database; the link relation among the documents is determined by the document link contained in each document in the document set D;

a user query module: firstly, receiving a query vector set by a query user q (q belongs to D) and a search condition submitted by the user q, and extracting a search keyword in the search condition; then, retrieving a group of documents Q matching the search keyword in the document set D; then, calculating the personalized ranking value of each document in the group of documents Q according to the query vector and the ranking vector of each document in the group of documents Q, or calculating the personalized ranking value of each document in the group of documents Q according to the query vector and the parameter vector of each document in the group of documents Q; finally, the group of documents Q are sorted according to the personalized sorting value, and links of a plurality of documents in the group of documents Q are sent to the user Q according to a sorting result;

reading a parameter vector U (m) (uwm1, uwm 2., uwmk., uwmmL) of the user m according to the signal of any one document n acquired by the user accessing the document signal acquisition module, wherein uwmk represents the correlation degree between the user m and a feature K (K belongs to K);

U*(m)＝F1[U(m)，D(n)]；

D*(n)＝F2[U(m)，D(n)]；

wherein said F1 (-) and said F2 (-) are functions of said U (m) and said D (n), respectively, as arguments;

for each feature K e K, the uwmk and dwnk are each a decreasing function of the frequency with which the user m accesses the document set D;

in an application example of the parameter vector updating algorithm, the specific updating methods of uwmk and dwnk are as follows:

uwmk ═ β 1. uwmk + λ 1(n, m, T). f1(dwnk) (for each K ∈ K)

dwnk ═ β 2 · dwnk + λ 2(m, n, T) · f2(uwmk) (for each K ∈ K)

Wherein λ 1(n, m, T) is an influence coefficient of the document n on the user m under the type T of the signal, and λ 2(m, n, T) is an influence coefficient of the user m on the document n under the type T of the signal; β 1 and β 2 are set normal numbers; said f1(dwnk) is an increasing function of said dwnk, said f2(uwmk) is an increasing function of said uwmk; for each K e K, the dwmk is a decreasing function of Σ (K e K) dwnk, the dwnk is a decreasing function of Σ (K e K) uwmk; the λ 1(n, m, T) and the λ 2(m, n, T) are each a decreasing function of the frequency with which the user m accesses the document set D.

2. The system of claim 1, wherein for each feature K e K, uwmk is an increasing function of the dwnk, and wherein dwnk is an increasing function of the uwmk.

3. The system according to claim 1, characterized in that after executing the parameter vector update algorithm for a set number of times, for each feature K e K, normalizing the kth user column vector (uw1K, uw 2K., uwMk); after the parameter vector updating algorithm is executed for a set number of times, for each feature K belonging to K, normalizing the K-th document column vector (dw1K, dw 2K.., dwNk).

4. The system of claim 3, wherein λ 1(n, m, T) and λ 2(m, n, T) are increasing functions of similarity between the parameter vector of the user m and the parameter vector of the document n, respectively.