CN109242710B

CN109242710B - Social network node influence ordering method and system

Info

Publication number: CN109242710B
Application number: CN201810931729.9A
Authority: CN
Inventors: 熊菲; 杨佳佩; 刘云; 张振江
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2018-08-16
Filing date: 2018-08-16
Publication date: 2022-03-11
Anticipated expiration: 2038-08-16
Also published as: CN109242710A

Abstract

The invention provides a social network node influence sequencing method and a system, which relate to the technical field of digital information processing, and the method comprises the steps of firstly collecting user personal homepage information, user posting information and user pair information, preprocessing the personal homepage information, the user posting information and the user pair information, and forming a training set and a testing set; then, according to the training set, establishing a transfer matrix model of the posts, and carrying out simulation calculation on the transfer matrix model to obtain optimal training parameters; and finally, establishing a test transfer matrix for post forwarding according to the test set and the optimal training parameters, calculating the test transfer matrix, and obtaining the social network node influence sequencing result. The method can find the attention possibility of the hidden node, so that influence ranking analysis is carried out on the data network with incomplete dynamic information and serious loss; the influence of the social network nodes can be more accurately analyzed under the condition of missing social network data.

Description

Social network node influence ordering method and system

Technical Field

The invention relates to the technical field of digital information processing, in particular to a social network node influence sequencing method and system.

Background

In the information era, analysis of human-to-human relationships is one of important cornerstones for measuring personal value, popularizing related products, realizing public opinion monitoring and planning related construction. According to statistics, by 3 months in 2017, the application markets of android and apples share about 500 ten thousand of APPs, and how to better recommend the APPs to the user is a key problem for improving user experience and increasing enterprise revenue, but the existing APP recommendation strategy is mainly based on user personal information, and the influence of user group influence on the APP in a social network is not considered. However, the user must be affected by the social network friends when downloading or purchasing the APP. Therefore, the problems of incorporating social influence among network users in the personalized recommendation algorithm, how to incorporate the social influence, and the like are all concerned by social network analysis. The social network includes an offline friend-making relationship network around an individual and also includes a social relationship established by an online social application, and such networks can be classified into a unidirectional-attention weak link network and a bidirectional-friend strong link network.

The method for paying attention to the relationship between people gradually spreads to the field of information internet from the traditional sociology and psychology modes, and the online social network analysis can be realized by means of a large amount of information acquisition capacity, a large amount of data mining algorithm and a related ranking algorithm.

The influence of the users on the online traffic can be expressed by the activity condition of the users, namely, the action and the thinking of the network users are influenced and changed by the action and the thinking of other people. And the character with large influence in the network plays a key role in a plurality of steps of network construction, spreading, transfer and the like. Therefore, how to evaluate the influence ability of the social network users, and order the users in the social network to obtain the important influence user nodes is the most basic problem requirement in the online network personal influence discussion. Node influence and sequencing in the social network are often the basis for subsequent social network discussion and research.

The early analysis method for the influence and the sequencing of the network nodes mainly adopts non-networked and data modes, such as questionnaire filling, telephone investigation and the like, and the modes have the defects of less acquired data, large time delay and many problems.

With the rapid development of the internet technology and the personal mobile network technology, mass data of an online social network is used as data support, and the using mode mainly comprises analyzing an attention relationship network structure, a transfer record and user activity/content word-sense resolution, so that the message transfer possibility of the users is extracted, the learning influence of the number of successful transfer pairs among the users is counted according to the sequence, and the transfer probability among the users is estimated as the influence through a Bernoulli model and a Jacard index model.

In the meantime, many excellent algorithms are proposed and applied, such as the Pagerank algorithm, which gives the same initial algorithm value to all nodes, and then performs several iterations until the algorithm value after the iteration is substantially unchanged, where the algorithm value of a node is the algorithm value according to which the final ranking depends, and the larger the value is, the more the influence of the node is. Because the Pagerank uses prime number correction when solving the problems that the sorting is not unique and the like, the matrix structure is greatly deformed, the improved LeaderRank algorithm is provided, the influence caused by correction is better reduced, and the result is ensured to be reliable. The analysis of the social network is a complex problem, which can not be solved by a certain method, but needs to comprehensively consider various factors to carry out optimization combination to identify the final social role and influence.

With the increasing emphasis on network security and information leakage risks, the traditional algorithm has some problems in information collection, and by taking information crawled by a Xinlang microblog crawler as an example, in the conventional mode at present, the microblog is not allowed to crawl information of concerned persons and fans in order to protect user information and avoid information leakage risks, and the user attention collection establishment faces to complex conditions. Content protection is also becoming more and more standard and strict for user historical postings, which results in that a lot of user posting and re-posting information may not be available. And the information is often necessary data for a classical influence analysis algorithm.

Disclosure of Invention

The invention aims to provide a method and a system for accurately judging the influence and the sequencing of social user nodes by integrating social network user information so as to solve the technical problems of one-sided consideration and inaccurate result of the existing social network node influence analysis method in the background art.

In order to achieve the purpose, the invention adopts the following technical scheme:

in one aspect, the invention provides a social network node influence ranking method, which comprises the following steps:

step S110: collecting user personal homepage information, user posting information and user pair information, and preprocessing the personal homepage information, the user posting information and the user pair information to form a training set and a testing set;

step S120: establishing a transfer matrix model of the posts according to the training set, and carrying out simulation calculation on the transfer matrix model to obtain optimal training parameters;

step S130: and establishing a test transfer matrix for post forwarding according to the test set and the optimal training parameters, calculating the test transfer matrix, and obtaining the social network node influence sequencing result.

Further, the step S110 specifically includes:

collecting personal homepage information, user posting information and user pair information to form a data set; the personal homepage information at least comprises a user ID, a user posting total number, user active time, the number of people concerned by the user and the number of people concerned by the user;

the user posting information at least comprises forwarded number of posts and commented number of the posts;

the user pair information comprises a concern relationship between the users;

cutting the data set into a training set and a testing set according to requirements, wherein the training set comprises training set user personal information and training set user pair information; the test set includes test set user personal information and test set user pair information.

Further, in step S120, the establishing a transition matrix model of posts according to the training set specifically includes:

step S121: determining a user pair impact factor f for forwarding posts in the training set₁：

Wherein, I_UIndicates the number of people the user U is interested in, S_VRepresents the number of people of interest of user V;

step S122: determining a user self influence factor f for forwarding the posts in the training set₂：

Wherein X represents the trade-off of social network to the importance of user U's commentary and comments, M_URepresents the total number of posts, T, of user U_UIndicating the active duration of user U, Z_UNumber of posts, P, of user U_UThe number of comments indicating the posts of the user U;

step S123: determining a total impact factor f for forwarding posts in the training set_uv：

f_uv＝1-exp(-(f₁)^m×(f₂)^1-m) Where m represents a training parameter, i.e. f₁And f₂The trade-off parameter of (1);

step S124: obtaining the probability p of the user U in the training set forwarding the self post by using a K-shell decomposition algorithm_uu：

Wherein n represents the number of user nodes, K_suA K-shell value representing a user U;

step S125: determining a probability P of being forwarded of posts of a user U in the training set_uv：

Step S126: according to p_uuAnd P_uvObtaining the training transfer matrix P:

further, in step S120, the simulating the transition matrix model to obtain the optimal training parameters specifically includes:

sequentially selecting C users in the training set and corresponding real reprint numbers M according to the average rank of the forwarded number of posts_cA plurality of different m values correspond to a plurality of different training transfer matrixes P, and an independent cascade model is adopted to carry out propagation simulation experiment on each P to obtain the expected average transfer number F of C users_c；

Determination of error MAPE value:

wherein c ═{1,...,C}；

And selecting the training parameter corresponding to the P with the minimum MAPE value as the optimal training parameter.

Further, in step S130, the establishing a test transfer matrix for forwarding posts according to the test set and the optimal training parameters specifically includes:

selecting the optimal training parameter as f according to the test set₁And f₂The test transfer matrix is established according to the method from step S121 to step S126.

Further, in step S130, the calculating the test transfer matrix to obtain the social network node influence ranking result specifically includes:

setting the initial value of each user value vector St to be 1, and obtaining a stable convergence value by using Markov iteration, wherein the calculation process is as follows:

St＝(1...1)_1×n×P_m，

repeating the following processes until the Euclidean norm error of the user value vector is smaller than the preset precision in the previous and next times, stopping the iteration process, and obtaining a stable convergence algorithm value S:

S＝St_1×n×P_m，

and taking each value of the obtained stable convergence algorithm value S as an algorithm value of each user, and comparing the algorithm values to obtain the social network node influence sequencing result.

In another aspect, the present invention provides a social network node influence ranking system, including:

the data preprocessing module is used for collecting user personal homepage information, user posting information and user pair information, and preprocessing the personal homepage information and the user posting information to form a training set and a testing set;

the training module is used for establishing a post transfer matrix model according to the training set, carrying out simulation calculation on the transfer matrix model and obtaining optimal training parameters;

and the test module is used for establishing a test transfer matrix for post forwarding according to the test set and the optimal training parameters, calculating the test transfer matrix and obtaining the social network node influence sequencing result.

Further, the data preprocessing module specifically includes:

the user pair information comprises a concern relationship between the users;

Further, the training module comprises:

the user pair influence factor determining unit is used for determining a user pair influence factor for forwarding the post according to the number of people to be attended of one user and the number of people to be attended of the other user in the user pair;

the user self influence factor determining unit is used for determining the user self influence factor forwarded by the posts according to the posting total number, the active duration, the forwarded number of the posts and the commented number of the posts of the user;

a total influence factor determining unit, configured to determine a total influence factor for forwarding the post according to a trade-off parameter between the user pair influence factor and the user own influence factor;

the user self-forwarding probability determining unit is used for acquiring the probability of forwarding the self post by the user by utilizing a K-shell decomposition algorithm;

a post forwarded probability determination unit for determining the probability of the post of the user being forwarded by other users;

the transition matrix model establishing unit is used for establishing a transition matrix model of the post according to the probability of the user forwarding the post and the forwarded probability of the post;

the optimal training parameter establishing unit is used for adopting an independent cascade model to carry out propagation simulation experiments on the transfer matrix model respectively to obtain an expected average transfer number, determining an error MAPE value and selecting a training parameter corresponding to the transfer matrix with the minimum MAPE value as the optimal training parameter;

further, the test module includes:

the test transfer matrix establishing unit is used for establishing a test transfer matrix according to the test set and the optimal training parameters;

and the influence sequencing establishing unit is used for calculating the test transfer matrix to obtain the social network node influence sequencing result.

The invention has the beneficial effects that: the method can find the attention possibility of the hidden node, thereby being capable of carrying out influence ranking analysis on the data network with incomplete dynamic information and serious deletion, providing a supplement scheme when a common algorithm cannot carry out analysis due to the missing data, and more accurately analyzing the influence of the social network node.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic block diagram of a social network node influence ranking system according to a first embodiment of the present invention.

Fig. 2 is a schematic flow chart of a social network node influence ranking method according to a second embodiment of the present invention.

Fig. 3 is a schematic block diagram of a social network node influence ranking system according to a third embodiment of the present invention.

Fig. 4 is a schematic diagram of a training set K-shell value and a corresponding user transfer condition according to the fourth embodiment of the present invention.

Fig. 5 is a diagram of a calculation result of the optimal transition matrix parameter m according to the fourth embodiment of the present invention.

Fig. 6 is a schematic diagram of a test set K-shell value and a corresponding user transfer condition according to the fourth embodiment of the present invention.

Fig. 7 is a verification diagram of the effect of the contrast image according to the fifth embodiment of the present invention.

FIG. 8 is a comparison kendall verification plot as described in example five of the present invention.

FIG. 9 is a comparison specific ranking verification diagram according to the fifth embodiment of the present invention.

Fig. 10 is a flowchart illustrating a social network node influence ranking method according to the fourth embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or modules having the same or similar functionality throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding of the embodiments of the present invention, the following description will be further explained by taking specific embodiments as examples with reference to the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

It will be understood by those of ordinary skill in the art that the figures are merely schematic representations of one embodiment and that the elements or devices in the figures are not necessarily required to practice the present invention.

Example one

As shown in fig. 1, an embodiment of the present invention provides a social network node influence ranking system, where the system includes:

In a first embodiment of the present invention, the data preprocessing module specifically includes:

the user pair information comprises a concern relationship between the users;

In practical application, the data preprocessing module in the first embodiment of the present invention is mainly used for obtaining a data set, where the data includes 3 types: the first type is personal homepage information which at least comprises a user ID, a user posting total number, a user grade or active duration, the number of user fans (a fan that user A pays attention to user B and is called A is called B), and the number of user concerns (a concern that user A pays attention to user B and is called B is called A); the second type is user posting information which at least comprises forwarded number and commented number of partial posts; the third type is the attention relationship among users, and at least comprises the attention relationship among partial users;

the method comprises the following steps of simply preprocessing a data set, generating a required form, simply cleaning the data set by advertisement filtration and the like, cutting the data set into a training set and a testing set according to requirements, and then respectively generating the following required forms: processing a personal homepage statistical information table, combining the third type data and the first type data, and adding two items of the average number of the posts and the average number of the comments of the user in the table; and cleaning the user attention relation table to ensure that each pair of attention information, fan and attention person is in the user personal information table.

In a first embodiment of the present invention, the training module includes:

in a first embodiment of the present invention, the test module includes:

Example two

As shown in fig. 2, a method for ranking influence of social network nodes by using the system according to the first embodiment of the present invention includes the following steps:

step S110: collecting user personal homepage information, user posting information and user pair information, and preprocessing the personal homepage information and the user posting information to form a training set and a testing set;

In a second embodiment of the present invention, the step S110 specifically includes:

the user pair information comprises a concern relationship between the users;

In a second embodiment of the present invention, in the step S120, the establishing a transition matrix model of posts according to the training set specifically includes:

Step S126: according to p_uuAnd P_uvObtaining the training transfer matrix P:

in a second specific embodiment of the present invention, in step S120, the simulating the transition matrix model to obtain the optimal training parameters specifically includes:

Determination of error MAPE value:

wherein, C ═ {1,. said, C };

In a second embodiment of the present invention, in the step S130, the establishing a test transfer matrix for forwarding posts according to the test set and the optimal training parameters specifically includes:

selecting the optimal training parameter as f according to the test set₁And f₂According to the step S121 to the step 126, the test transfer matrix is established.

In a second embodiment of the present invention, in the step S130, the calculating the test transfer matrix to obtain the social network node influence ranking result specifically includes:

St＝(1...1)_1×n×P_m，

S＝St_1×n×P_m，

EXAMPLE III

As shown in fig. 3, a third embodiment of the present invention provides a social network node influence ranking system, where the system includes:

the data preprocessing unit 21 acquires the microblog pcu of the surf microblog data set from the network, performs simple preprocessing, and generates 4 training sets and 4 test sets in the form required by the design, where the training sets and the test sets are respectively: training set user personal information tables, training set user attention information tables, test set user personal information tables and test set user attention information tables;

the first training unit 22 is used for generating a transfer matrix Pm for forwarding posts according to a training set, wherein 11 sampling values with m being 0-1 and 0.1 being equal intervals are taken, and different transfer matrices Pm are correspondingly generated;

the second training unit 23 is configured to perform a post propagation simulation experiment of "the number of times a post is rotated" according to 11 different transition matrices Pm, and screen an optimal value in the transition matrix Pm according to the MAPE value to obtain a training value m;

the first test unit 24 generates a transfer matrix P according to the test set and the training result m, and generates the algorithm design ranking;

the second testing unit 25 generates other algorithm ranks according to the test set and other algorithms;

the third testing unit 26 is used for carrying out consistency check according to the ranking result of the algorithm and the ranking results of other algorithms to prove the superiority of the ranking result of the algorithm;

the data preprocessing unit 21 specifically includes:

the data set obtaining subunit 211 obtains a billow microblog data set microbugpcu, which is obtained from a billow microblog by Jun Liu et al at 2015.3.17. The data set mainly comprises 4 files of weibo _ user.csv (user personal information), follow-followee.csv (user attention information), user _ post.csv (post content information) and post.csv (post content information), wherein the weibo _ user.csv contains information such as 700+ user ID, name, gender, grade, private information, zip code, fan value, total number of attention people and the like; csv records about 14 ten thousand fan-attention relationship pairs, including users not recorded in the weibo _ user; the user _ post.csv and post.csv record the posting content, post ID, poster ID, transfer number, comment number and the like of the users; the data set is simply cleaned, interference contents such as zombie numbers, trumpets and the like are removed, and a plurality of missing values still exist and need to be manually removed;

a data set preprocessing subunit 212 for performing simple preprocessing to generate a training set and a test set; and processing the personal information table of the user, adding the information of the average number of transferred labels and the average number of commented items, and ensuring that the following item information exists for each user: ID, name, user level, number of posts, number of fans, number of people paying attention to, average number of posts of the user being re-posted/commented; and cleaning the user attention information table to ensure that each pair of attention information, fan and attention person is in the user personal information table. So far, 4 tables are generated, which are respectively: training set user personal information tables, training set user attention information tables, test set user personal information tables and test set user attention information tables.

The first training unit 22 specifically includes:

the transition matrix Pm is used for describing possible post transfer probability, and when the transfer probability is larger than 0, even if no transfer record or attention relation among users is observed at present, transfer possibility exists in the future;

the kshell value operator unit 221 calculates a kshell value of the training set, and the kshell value calculation step is as follows: firstly, selecting points with the degree of 0 in a network for peeling; then all the users in the newly formed network are selected to be stripped when the degree is judged to be 1 point, then the degree of part of the users in the newly formed network is changed, all the users are continuously selected to be stripped when the degree is judged to be 1 point, and the steps are repeated until the newly formed network has no strippable point, and all the stripped degree 1 points are called as 1 shell; repeating the steps to obtain 2shell … kshell until all nodes are peeled off, so that each node has a kshell value of an integer belonging to the node;

as shown in fig. 4, which is a graph of kshell values and the posts of the corresponding users in the training set of the present invention, it can be seen that kshell values and posts do not exhibit obvious correlation, and perform poorly, and cannot be used alone as a method for analyzing such network rankings.

The training parameter simulation sub-unit 222 respectively takes 0-1 equal interval values for m, such as 0.1 interval, and completes the following processes in 11 times;

fuv and Puu compute subunit 223, compute fuv and Puu;

definition 1, user pair impact factor f 1:

wherein Iu represents the fan number of user u, and Sv represents the attention number of user v;

definition 2, user self-influence factor f 2:

wherein x is a parameter of 0-1, which represents the balance of social application on the importance degree of the commentary and the commentary, and the settable value without loss of generality is 0.5;

definition 3, transition probability fuv:

f_uv＝1-exp{-(f1)^m*(f2)^1-m}

wherein m is 0-1, and 11 parameters are sampled at intervals of 0.1, which represents the balance between f1 and f2 and needs training;

definition 4, transition probability Puu:

wherein n is the number of user nodes, and ksu is the kshell value.

Puv calculation subunit 224, calculate Puv;

definition 5, forwarding probability Puv:

a transition matrix P generation subunit 225 for generating a transition probability matrix P, where different m is represented as Pm;

according to Puv and Puu, a transition matrix Pm is obtained:

the second training unit 23 specifically includes:

the data extraction subunit 231 is used for extracting a user list and the number of the transferred posts according to the data set content, wherein the number of the transferred posts is ranked 20 users before the post number is obtained;

a propagation experiment subunit 232, which has different transition matrixes P for different m, uses 20 users as a single starting point to independently cascade propagation experiments for each transition matrix Pm, and repeats each experiment 10 times to obtain an average value Fc;

the MAPE value calculating operator unit 233 calculates MAPE values by using Fc and Mc for each Pm;

where C is the number of users, C is the specific user, and MAPE indicates the error between the predicted data and the actual data;

a training value selecting subunit 234, which selects the minimum MAPE value corresponding to the optimal m as the training result;

as shown in fig. 5, which is a graph of the calculation result of the second training value optimal m according to the embodiment of the present invention, m takes 11 samples with 0-1 and 0.1 as an interval, to obtain 11 transition probability matrices Pm with m being 0 and 0.1 … … 1, then MAPE values are respectively calculated, and the MAPE value is taken to be the minimum, to obtain the corresponding optimal m value, where the optimal m value obtained in this embodiment is 0.5.

The first test unit 24 specifically includes:

a kshell value operator unit 241, which calculates a kshell value of the test set;

as shown in fig. 6, a graph of kshell values and corresponding user re-posts of a second test set according to an embodiment of the present invention shows that kshell values are poor in performance and cannot be used alone as a method for analyzing such network rankings.

The transfer matrix calculation subunit 242 calculates a formula by using the training value m, the test set data and the transfer matrix model to obtain a test set transfer matrix P;

the sorting calculation subunit 243 sets the initial value vector as the full 1 vector, multiplies the transition probability matrix P, and iterates continuously until convergence to obtain an algorithm value and obtain the top 10 ranks. The stable convergence value may be iteratively obtained using a markov approach. Let the iteration initial value be 1, calculate:

St＝(1…1)_1*n*P_n*n

repeating the following processes until the error delta meets the precision requirement to obtain a stable convergence algorithm value vector S:

St＝St_1*n*P_n*n

when calculating Δ, a two-norm (euclidean norm) may be used, and the length of the difference vector between two times of St before and after calculation satisfies the requirement, and is considered to be convergent.

The second testing unit 25 specifically includes:

degree centrality is used as a local influence algorithm to represent reference, and betweenness centrality and tight centrality or approximate centrality are used as global influence algorithm to represent reference; therefore, the second embodiment applies the three indexes to the test set data to obtain the top ten ranks respectively for subsequent comparison.

The third testing unit 26 specifically includes:

a comparison image effect verification subunit 261, which draws different algorithms and real data inspection images, and compares them; firstly, after the algorithm is applied and calculated by a test set, 10 users before the rank are obtained from the test set, the average post transfer quantity of the users is obtained from a data set, then the calculated value of the user algorithm is taken as the abscissa, the average post transfer quantity of the corresponding users is taken as the ordinate, and whether positive correlation is presented or whether consistency is presented between the two is observed; for other algorithms, such as degree centrality, betweenness centrality and tight centrality indexes, the consistency relation check graph can be made by taking the algorithm or the index value as the horizontal axis and the corresponding user transferred number as the vertical axis, and whether the algorithm images show positive correlation or consistency relation or not can be observed; observing whether the method has a consistency relation with real data or not, and intuitively observing whether the method is superior to other methods in performance or not from the image;

as shown in fig. 7, a verification diagram of the effect of the two-comparison image according to the embodiment of the present invention includes consistency between the top 10 of the rank and the real transfer data (res.myalgo), consistency between the top 10 of the degree centrality and the real transfer data (res.indgcent), consistency between the top 10 of the tight centrality and the real transfer data (res.closed element), and consistency between the top 10 of the degree centrality and the real transfer data (res.betweencent) and the comparison image, which can be visually seen through image comparison, the index effect of the tight centrality is very poor, and there is no obvious consistency; the effect of betweenness centrality and degree centrality can be accepted, the number of posts to be posted is not reduced along with the increase of values, monotonicity is realized, but the degree centrality is lack of resolution, and the betweenness centrality has obvious defects of engineering application cost due to higher complexity; the algorithm value and the transfer number under the method are consistent, users with high influence cannot be hidden in a user group with a low algorithm calculation result value (the influence ranking is considered to be low) and a user group with a medium algorithm calculation result value (the influence ranking is considered to be medium), the screening range of users with high influence can be narrowed in a region with a large algorithm calculation result value through the algorithm, and a good ranking effect is achieved;

a comparison kendall test verification subunit 262, which calculates different algorithms and real data kendall tests, and researches and tests results and values (kendall consistency test, which means that different methods are used for ranking the same sample, and then, for each two ranks, the comparison and calculation of ranking similarity are performed, wherein one method is the kendall consistency test, the calculation method mainly considers pairs of same sequence and opposite sequence, if the rank of user a in method a is higher than that of the user, and the rank of user a in method B is also higher than that of user B, the same sequence and the sign are positive, otherwise, the opposite sequence and the sign are negative, and the positive and negative sums of the same sequence and the opposite sequence are counted, the larger the value of the sum indicates that the same sequence pairs are more, the closer the rank is, and if the kendall test values of a certain method and the real rank are larger, the rank obtained by the method is more accurate); for the top 10 ranking obtained by the algorithm calculation test set, the top 10 ranking obtained by the other algorithms and the top 10 ranking of the real data (the highest 10 users with the transferred number) are subjected to a kendall consistency check mode to obtain a kendall check result, wherein the numerical value means consistency between two vectors, if the kendall value is larger, the ordering of the two vectors is more consistent, and if the kendall coefficient obtained by the real data calculation is larger, the ordering result of the algorithm is more consistent with the situation of the real data, so that the method effectiveness degree is numerically and definitely known;

as shown in fig. 8, which is a verification diagram for a two-comparison kendall test according to an embodiment of the present invention, it can be seen that the consistency between the close centrality (clCent) and the real sticky note ranking data (realRepo) is poor, and a correct ranking result cannot be obtained at all by using the close centrality; the betweenness centrality (bwCent) has the defects of difficult engineering realization and high cost due to high complexity; by using the algorithm, the difference between the degree and the centrality (dgCent) is not large, and a better related trend can be obtained, so that the screening range of the high-influence users can be narrowed.

Comparing the specific ranking verification subunit 263, outputting the top 10 users and algorithm/real values of different algorithms and real data, and specifically comparing; and (3) calculating top 10 ranks obtained by the test set through the algorithm, listing other algorithms and top 10 of real data, wherein the listed items comprise the ranking of each algorithm/real data ranking user ID, the algorithm value/real average posted sub-number, and obtaining algorithm effect analysis from the specific ranking result and the algorithm value.

As shown in fig. 9, for a comparative specific ranking verification graph in the embodiment of the present invention, it can be known from an analysis graph that 3 users before ranking are 3 before the ranking of the algorithm of the present invention in the real data, which indicates that the high-influence users have a high algorithm value, the present invention is effective.

A time counting subunit 264, which calculates the time difference by setting time stamps at the beginning and end of the operation of the embodiment, considering whether the time is acceptable when the total time of the output program is used; as can be seen from FIG. 8, the network analysis embodiment two, which includes 700+ user interconnections, runs for 42 seconds, which matches the o (n ^2) complexity of the theoretical analysis, and can be further optimized and reduced because embodiment two uses python language which runs slower and takes more time to output contrast verification images.

Example four

As shown in fig. 10, a fourth embodiment of the present invention provides a method for ranking influence of social network nodes by using the system described in the third embodiment. The method mainly comprises the following steps:

step 11, acquiring a data set, and simply preprocessing the data set to generate a training set and a test set in a form required by the design;

step 12, generating a transfer matrix Pm for forwarding posts according to the training set, wherein different training parameters m generate different transfer matrices Pm;

step 13, carrying out a simulation experiment of post propagation according to different transfer matrixes Pm, and screening an optimal value in the transfer matrixes Pm;

and 14, generating a transfer matrix P according to the test set and the training result m, and generating the algorithm design ranking.

The step 11 comprises:

step 111, obtaining a data set, wherein the data comprises 3 types: the first type is personal homepage information which at least comprises a user ID, a user posting total number, a user grade or active duration, the number of user fans (a fan that user A pays attention to user B and is called A is called B), and the number of user concerns (a concern that user A pays attention to user B and is called B is called A); the second type is user posting information which at least comprises forwarded number and commented number of partial posts; the third type is the attention relationship among users, and at least comprises the attention relationship among partial users;

step 112, simply preprocessing the data set to generate a required form, simply cleaning the data set by advertisement filtering and the like, cutting the data set into a training set and a testing set according to requirements, and then respectively generating the following required forms: processing a personal homepage statistical information table, combining the third type data and the first type data, and adding two items of the average number of the posts and the average number of the comments of the user in the table; and cleaning the user attention relation table to ensure that each pair of attention information, fan and attention person is in the user personal information table.

The step 12 specifically includes:

the generation method of the transfer matrix Pm is as follows:

definition 1, user pair impact factor f 1: the more the fans I of the user are, the stronger the influence is, the more the attention number S of the user is, the stronger the sensitivity is, and if the attention number S of one user is larger, the influence of a single node on the node is relatively diluted, so that the parameters are set:

wherein Iu represents the fan number of the user U, Sv represents the attention number of the user V, and f1 represents the description of the influence of the user U on the user pair of the user V, and the fact that the condition that 0 cannot be divided by f1 to be 0 can bring about subsequent problems is considered;

definition 2, user self-influence factor f 2: on one hand, the node activity degree, namely the number of posts/the total active time, can be considered, and the total active time can be embodied by the user level; on the other hand, the post quality can be embodied by the number of commentary and the number of commentary, so that parameters are set

The parameter x is 0-1, represents the balance of social application on the sticky note and comment weight degree, the settable value without losing generality is 0.5, if a user has a plurality of posts to be captured, the average sticky note and the average comment number are obtained, and the follow-up problems caused by the fact that the problem except 0 and the f2 is 0 are considered;

definition 3, transition probability fuv, indicating the influence of different users U on V, i.e. V transit U post probability

f_uv＝1-exp{-(f1)^m*(f2)^1-m}

The exponential form is adopted because the increasing trend of fuv along with the increase of f1 and f2 is met, m is a key training parameter with the value of 0-1 and is used for distributing the proportion of the influence factors of the user to the influence factors and the influence factors of the user in the propagation process, namely the balance of f1 and f 2;

definition 4, transition probability Puu, representing the probability that the user points to itself; considering that the more probability that one user is reprinted by the other users, the less the user points to the user, and thus the user is generally the core user in the network, namely the point with larger kshell decomposition value, the user can point to the user more slightly

Wherein n is the number of user nodes, Ksu is a kshell value, and when the kshell value is larger, the node is more positioned in the network center, and the possibility that the node points to the node is relatively smaller;

definition 5, forwarding probability Puv: the matrix pattern can be obtained by definition 3 and definition 4, but needs to be satisfied in consideration of the probabilistic forwarding matrix definition

∑P_uv+P_uu＝1

Thus is provided with

Considering that Puv is generally too small to facilitate subsequent training, when Sv < average (Sv) (average of Sv), fuv is set to be the minimum value.

According to Puv and Puu, a transition matrix Pm is obtained:

the step 13 comprises:

performing a propagation simulation experiment of the posts, screening an optimal value in a transfer matrix Pm, adopting an independent cascade model in the propagation simulation experiment, and taking out '20 users before the average post is ranked by the number of revolutions' in a corresponding data set to obtain a user list and a corresponding real number of revolutions Mc;

introduction of independent cascade model: in the independent cascade model, each node has two states, activated and not activated, wherein activation indicates that the node accepts or propagates some information (e.g., forwarding on microblog, likes on microblog) [ Libang, Chuya Nu, Von Jian Hua, Xuyao Strong [ J ] computer science, 2016,39(04):643- "656 ]; the independent cascade model is a modeling situation, when a node u in the model is infected, the node u tries to infect a neighbor node v with a possibility Puv, the infection can only be used once in one direction between a pair of users, u infects all neighbor nodes v without mutual interference, and all different users infect v without mutual interference until u tries to infect all neighbor users v, and then the infected users v are processed in sequence according to the previous method; the activated node can not be activated again, namely, the information can not be re-posted by the same user; the message transfer process applying the independent cascade model comprises the following steps:

given an initial user or users, infection in sequence becomes the starting point for the beginning; if user u is infected, then u will have a chance to infect all buddies once each, with each process infecting possibilities Puv, independent of itself; when Puv is more, it indicates that the infection probability is more, u is more likely to infect v;

if the node w is not infected at the time t, all infected neighbors of the node w try to infect the node w, but the infected neighbors do not comprise the tried infection process, and if the node w is infected, the node w is switched into an infected state at the time t + 1;

repeating the process until all the infection attempts are completed, namely the maximum infectable range is reached, wherein the infection range is the maximum transmission range of the information from the initial node, and averaging;

for different m with different transfer matrixes Pm, carrying out propagation experiments for 10 times by using independent cascade models respectively to obtain an average transfer number Fc;

definition 6, error MAPE value: the expression is the error between the predicted data and the real data, if the calculated result is smaller, the calculated error is smaller, in other words, the corresponding values of P and m are better and more accepted;

wherein C is the number of users and C is a specific user;

calculating Fc and MAPE of different m, and selecting m with the smallest MAPE as an optimal value for training.

The step 14 comprises:

step 141, generating a transition matrix P, and generating the transition matrix P according to the data of the test set and the optimal training value of m by the method of step 12;

step 142, generating the ranking of the algorithm, setting the initial value of each user value vector St to be 1, and obtaining a stable convergence value by using Markov iteration, wherein the calculation process is as follows:

St＝(1…1)_1*n*P_n*n

St＝St_1*n*P_n*n

comparing the Euclidean norm errors delta of the user value vectors of the previous and next times, stopping the iteration process when the errors delta are smaller than the preset precision, taking all values of the obtained user values St as algorithm values of each user, and comparing the values to obtain the user ranking of the algorithm;

convergence in this process can be demonstrated: if the transition matrix P converges, 3 conditions need to be satisfied:

p is a random matrix;

p is irreducible;

p is non-periodic;

for the first requirement, the random matrix: let Pij be the i row and j column of P, with any

i

1, 2 … n and

j

1, 2 … n, Pij ≧ 0, and at the same time satisfying any

i

1, 2 … n, Pij summing j to 1, it is clear that the matrix P is non-negative and satisfies each row and 1;

for the second condition, the matrix P is a matrix that satisfies the requirement and only if only the directed image of the network corresponding to P is a network image that is strongly connected (any two nodes can reach), that is, a path can be found between any two points, and since all elements in the transfer matrix P of the algorithm are all positive, such a path must exist, so the matrix P satisfies the irreducible condition;

for the third condition, the periodicity refers to the iterative value changing repeatedly according to the rule, since the relationship that the aperiodic and the element matrix are equivalent can be known according to the related knowledge, the element matrix refers to the matrix with the number of times of a power being a positive matrix, because all elements of P are positive, P must also satisfy the equivalence requirement, that is to say, the third aperiodic condition is satisfied;

meanwhile, the method mainly spends time in the calculation of the transfer matrix, the calculation of the independent cascade model and the Markov iteration, all elements in the transfer probability matrix need to be independently operated in the algorithm, the time o (n ^2) is spent, the element calculation is relatively simple, data can be directly obtained from a table, simple addition, division and exponential operation are carried out, and therefore the complexity of the generation time of the transfer matrix is o (n ^ 2); in the independent cascade model, the worst case is to infect one user at a time until all the users are infected, and the time consumption is also o (n ^2) magnitude at the time, so the time complexity is o (n ^ 2). Although the Markov iteration process is relatively long in time consumption, the process is similar to a classical Pagerank transfer matrix in the method, the transfer probability of the same user in the Pagerank algorithm is approximately halved, and researches prove that the Pagerank can be converged generally in 50-75 iterations, and in conclusion, the total time complexity of the method is o (n ^2), which is acceptable time complexity.

In summary, the social network node influence ranking system provided by the embodiment of the present invention is applicable to a social network with incomplete dynamic information. The method comprises the steps that a training set and a test set in the form required by the system are generated by obtaining a data set containing personal information, post information and attention relationship information of a social network user and performing simple preprocessing such as data cleaning and project merging; establishing a social network post transfer matrix generation and screening model according to the training set, and generating a post forwarding transfer matrix by sequentially considering the personal network position of a user, the local network influence of the user, the self post transfer probability of the user and the inter-user post transfer probability; respectively carrying out simulation experiments of post propagation according to different transition matrixes, comparing relative errors of actual values and simulated values of post propagation ranges, and selecting the minimum error to screen out the optimal transition matrix and the corresponding training parameters; and generating a transfer matrix by using the same modeling method according to the test set and the training parameters, and finally obtaining stable influence sequencing.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A social network node influence sequencing method is used for recommending APP downloading or purchasing to a user, and is characterized by comprising the following steps:

step S130: establishing a test transfer matrix for post forwarding according to the test set and the optimal training parameters, calculating the test transfer matrix, and obtaining the social network node influence sequencing result;

the step S110 specifically includes:

the user pair information comprises a concern relationship between the users;

cutting the data set into a training set and a testing set according to requirements, wherein the training set comprises training set user personal information and training set user pair information; the test set comprises test set user personal information and test set user pair information;

in step S120, the establishing a transition matrix model of posts according to the training set specifically includes:

Step S126: according to p_uuAnd P_uvObtaining a training transfer matrix P:

2. the method according to claim 1, wherein in the step S120, the simulating the transition matrix model to obtain the optimal training parameters specifically includes:

Determination of error MAPE value:

wherein, C ═ {1,. said, C };

3. The method according to claim 2, wherein in step S130, the establishing a test transfer matrix for post forwarding according to the test set and the optimal training parameters specifically comprises:

selecting the optimal training parameter as f according to the test set₁And f₂The test transfer matrix is established according to the method from the step S121 to the step S126.

4. The method according to claim 3, wherein in the step S130, the calculating the test transfer matrix to obtain the social network node influence ranking result specifically includes:

St＝(1...1)_1×n×P_m，

S＝St_1×n×P_m，

5. A social network node impact ranking system for performing the method of any of claims 1 to 4, the system comprising:

6. The system of claim 5, wherein the data preprocessing module specifically comprises:

the user pair information comprises a concern relationship between the users;

7. The system of claim 6, wherein the training module comprises:

and the optimal training parameter establishing unit is used for adopting the independent cascade model to carry out propagation simulation experiments on the transfer matrix model respectively to obtain an expected average transfer number, determining an error MAPE value and selecting the training parameter corresponding to the transfer matrix with the minimum MAPE value as the optimal training parameter.

8. The system of claim 7, wherein the test module comprises: