CN113033709A - Link prediction method and device - Google Patents

Link prediction method and device Download PDF

Info

Publication number
CN113033709A
CN113033709A CN202110485583.1A CN202110485583A CN113033709A CN 113033709 A CN113033709 A CN 113033709A CN 202110485583 A CN202110485583 A CN 202110485583A CN 113033709 A CN113033709 A CN 113033709A
Authority
CN
China
Prior art keywords
network
link prediction
algorithm
target
community
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110485583.1A
Other languages
Chinese (zh)
Inventor
曾琳奕
夏冰沁
雷经纬
熊辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110485583.1A priority Critical patent/CN113033709A/en
Publication of CN113033709A publication Critical patent/CN113033709A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of machine learning, and particularly discloses a link prediction method and a device, wherein the method comprises the following steps: acquiring a training network set and an algorithm label set, wherein the training network set comprises a plurality of network structure characteristics of each community network in a plurality of community networks, and the algorithm label set comprises a link prediction algorithm identification corresponding to each community network in the plurality of community networks in the training network set; and generating a decision tree model based on the training network set and the algorithm label set, determining a target link prediction algorithm corresponding to the target community network by using the decision tree model and a plurality of network structure characteristics of the target community network, and performing link prediction on the target community network according to the target link prediction algorithm. According to the scheme, the scoring link prediction algorithm suitable for the community network can be determined according to the network structure characteristics of the target community network by using the decision tree model, the reliability is high, and the labor cost and the time cost can be saved.

Description

Link prediction method and device
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a link prediction method and apparatus.
Background
With the rapid development of internet technology and the continuous popularization of high-speed and stable network services, the website experience is increasingly improved, and the multimedia interaction function is continuously innovated, perfected and developed. At present, more and more enterprises establish online brand communities, and the brand communities can accurately gather target customers scattered by the enterprises, are favorable for developing targeted marketing activities, and gradually become new carriers for marketing activities of the enterprises and tools for establishing strong and persistent relationships with the customers. A number of consumer studies have long shown that groups can influence individual consumption decisions, and that user members in an online brand community are often influenced by groups of friends. Personalized marketing based on friend group influence refers to recommending users who are keen to use certain products to target customers in a brand community, and product marketing is carried out on the target customers by utilizing group influence effect, so that the marketing efficiency of the brand community is greatly improved.
The Scoring Link Prediction Algorithm (SLPA) is a main means for recommending friends, and is used for predicting whether links exist between node pairs in a network based on the network topology. Although the SLPA has been widely used for making friend recommendations, such as common neighbor algorithm (CN), high-rank node favorable index algorithm (HPI), and resource allocation algorithm (RA), there is no model suitable for all network community structures, so that it is necessary to consider a plurality of candidate SLPAs well as select the best one according to network characteristics, but this is a complicated task and requires a lot of expert experience.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the application provides a link prediction method and a link prediction device, so that a scoring link algorithm suitable for a specific network is selected from multiple candidate scoring link prediction algorithms.
The embodiment of the application provides a link prediction method, which comprises the following steps: acquiring a training network set and an algorithm label set, wherein the training network set comprises a plurality of network structure characteristics of each community network in a plurality of community networks, and the algorithm label set comprises a link prediction algorithm identification corresponding to each community network in the plurality of community networks in the training network set; and generating a decision tree model based on the training network set and the algorithm label set, determining a target link prediction algorithm corresponding to the target community network by using the decision tree model and a plurality of network structure characteristics of the target community network, and performing link prediction on the target community network according to the target link prediction algorithm.
An embodiment of the present application further provides a link prediction apparatus, including: the system comprises an acquisition module, a calculation module and a calculation module, wherein the acquisition module is used for acquiring a training network set and an algorithm label set, the training network set comprises a plurality of network structure characteristics of each community network in a plurality of community networks, and the algorithm label set comprises a link prediction algorithm identification corresponding to each community network in the plurality of community networks in the training network set; the generation module generates a decision tree model based on the training network set and the algorithm tag set, so that a target link prediction algorithm corresponding to the target community network is determined by utilizing the decision tree model and a plurality of network structure characteristics of the target community network, and link prediction is carried out on the target community network according to the target link prediction algorithm.
An embodiment of the present application further provides a computer device, which includes a processor and a memory for storing processor-executable instructions, where the processor executes the instructions to implement the steps of the link prediction method described in any of the above embodiments.
Embodiments of the present application further provide a computer-readable storage medium, on which computer instructions are stored, and when executed, the computer instructions implement the steps of the link prediction method described in any of the above embodiments.
In the embodiment of the application, a link prediction method is provided, which may obtain a training network set and an algorithm tag set, where the training network set includes a plurality of network structure features of each of a plurality of community networks, the algorithm tag set includes a link prediction algorithm identifier corresponding to each of the plurality of community networks in the training network set, and a decision tree model may be generated based on the training network set and the algorithm tag set, so that a target link prediction algorithm corresponding to a target community network may be determined by using the decision tree model and the plurality of network structure features of the target community network, and a link prediction may be performed on the target community network according to the target link prediction algorithm. According to the scheme, the decision tree model is trained according to the network structure characteristics of the community network and the corresponding scoring link prediction algorithm, so that the scoring link prediction algorithm suitable for the community network can be determined according to the network structure characteristics of the target community network by using the decision tree model, the scoring link prediction algorithm suitable for various community networks can be quickly selected without subjective experience, the reliability is high, and the labor cost and the time cost can be saved. And then, the selected scoring link prediction algorithm can be used for carrying out link prediction on the community network, the prediction accuracy is high, and the prediction result can be applied to the prediction of friend links between friend group node pairs of the community network, so that the influence of groups on individuals can be utilized, and the recommendation efficiency of a certain product or service can be improved by best utilizing the group effect.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, are incorporated in and constitute a part of this application, and are not intended to limit the application. In the drawings:
fig. 1 shows a flow chart of a link prediction method in an embodiment of the present application;
fig. 2 is a flow chart illustrating a link prediction method in an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a decision tree model in a link prediction method according to an embodiment of the present application;
FIG. 4 shows a comparison of predicted performance for various scored link prediction algorithms;
FIG. 5 shows a comparison graph of prediction performance of the flexible link prediction model, the target link prediction algorithm selected by the decision tree model, and the preferred scored link prediction algorithm;
FIG. 6 shows the number of nodes in a circle versus AUC;
fig. 7 is a schematic diagram of a link prediction apparatus in an embodiment of the present application;
FIG. 8 shows a schematic diagram of a computer device in an embodiment of the application.
Detailed Description
The principles and spirit of the present application will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present application, and are not intended to limit the scope of the present application in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present application may be embodied as a system, apparatus, device, method or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
The embodiment of the application provides a link prediction method. Fig. 1 shows a flowchart of a link prediction method in an embodiment of the present application. Although the present application provides method operational steps or apparatus configurations as illustrated in the following examples or figures, more or fewer operational steps or modular units may be included in the methods or apparatus based on conventional or non-inventive efforts. In the case of steps or structures which do not logically have the necessary cause and effect relationship, the execution sequence of the steps or the module structure of the apparatus is not limited to the execution sequence or the module structure described in the embodiments and shown in the drawings of the present application. When the described method or module structure is applied in an actual device or end product, the method or module structure according to the embodiments or shown in the drawings can be executed sequentially or executed in parallel (for example, in a parallel processor or multi-thread processing environment, or even in a distributed processing environment).
Specifically, as shown in fig. 1, a link prediction method provided by an embodiment of the present application may include the following steps:
step S101, a training network set and an algorithm label set are obtained, wherein the training network set comprises a plurality of network structure characteristics of each community network in a plurality of community networks, and the algorithm label set comprises a link prediction algorithm identification corresponding to each community network in the plurality of community networks in the training network set.
In this embodiment, a training network set and an algorithm label set may be obtained. The training network set may include a plurality of network structure features of each of the plurality of community networks. The plurality of network structure features may include at least one of: average shortest path, average degree, node betweenness, link betweenness, average cluster coefficient and other network structure characteristics. The algorithm label set comprises link prediction algorithm identifications corresponding to all community networks in the poison dog community networks in the training network set. The link prediction algorithm identifier corresponding to each community network may be one of multiple candidate link prediction algorithms. The plurality of candidate link prediction algorithms may include various link prediction algorithms such as a common neighbor algorithm (CN), a large node advantage indicator algorithm (HPI), a resource allocation algorithm (RA), and the like.
Step S102, a decision tree model is generated based on a training network set and an algorithm tag set, so that a target link prediction algorithm corresponding to a target community network is determined by utilizing the decision tree model and a plurality of network structure characteristics of the target community network, and link prediction is carried out on the target community network according to the target link prediction algorithm.
After the training network set and the algorithm label set are obtained, a decision tree model can be generated based on a plurality of network structure features corresponding to each community network in a plurality of community networks in the training network set and a link prediction algorithm identification corresponding to each community network. After the decision tree model is obtained, a target link prediction algorithm corresponding to the target community network can be determined by using the decision tree model and a plurality of network structure characteristics of the target community network. For example, the decision rule may be extracted from the generated decision tree model. And calculating a plurality of community network structure characteristics corresponding to the target community network. And then, determining a target link prediction algorithm corresponding to the target community network according to the decision rule and a plurality of community network structure characteristics corresponding to the target community network. The target link prediction algorithm is one of a plurality of candidate link prediction algorithms. After determining the target link prediction algorithm suitable for the target community network, the target link prediction algorithm may be used to perform link prediction on the target community network.
In the embodiment, the decision tree model is trained according to the network structure characteristics of the community network and the corresponding scoring link prediction algorithm, so that the scoring link prediction algorithm suitable for the community network can be determined according to the network structure characteristics of the target community network by using the decision tree model, the scoring link prediction algorithm suitable for various community networks can be quickly selected without subjective experience, the reliability is high, and the labor cost and the time cost can be saved. And then, the selected scoring link prediction algorithm can be used for carrying out link prediction on the community network, the prediction accuracy is high, and the prediction result can be applied to the prediction of friend links between friend group node pairs of the community network, so that the influence of groups on individuals can be utilized, and the recommendation efficiency of a certain product or service can be improved by best utilizing the group effect.
In some embodiments of the present application, obtaining the training network set and the algorithm label set may include: acquiring a training network set, wherein the training network set further comprises a node set and an edge set of each community network in a plurality of community networks; determining unconnected edge node pairs in each community network according to the node set and the edge set of each community network; assigning score values to unconnected node pairs in each community network by using each candidate link prediction algorithm in the multiple candidate link prediction algorithms to obtain multiple total score values corresponding to each community network, wherein each total score value in the multiple total score values corresponding to each community network corresponds to each candidate link prediction algorithm; and determining the identifier of the candidate link prediction algorithm corresponding to the maximum total score value in the total score values corresponding to the community networks as the algorithm label corresponding to the community networks to obtain an algorithm label set.
Specifically, the plurality of community networks in the training network set may be undirected networks, which may include a set of nodes V and a set of edges E. For a network, the total number of nodes is N, the number of edges is M, and the network has N (N-1)/2 node pairs, i.e., the complete set U. And node pairs not belonging to the edge set E in the full set U are unconnected node pairs, and the number of the unconnected node pairs is N (N-1)/2-M. And giving a link prediction algorithm, giving a score value to each pair of unconnected edge nodes, and then sequencing all the unconnected edge node pairs from large to small according to the score value, wherein the probability of the connected edge of the node pair arranged at the top is the maximum. In order to determine the link prediction algorithm which is most suitable for a certain specific network, score values are given to all unconnected node pairs in the network by utilizing each algorithm in a plurality of candidate link prediction algorithms, and the total score values corresponding to the candidate link prediction algorithms are obtained after the score values are added. The candidate link prediction algorithm corresponding to the total score value with the largest total score value may be determined as the link prediction algorithm corresponding to the network. And determining the most suitable link prediction algorithm aiming at each community network in a plurality of community networks in the training network set, and determining the identifier of the optimal link prediction algorithm corresponding to each network as the algorithm label of the network to obtain the algorithm label set. By the method, a training network set and an algorithm label set for training the decision model can be obtained.
In some embodiments of the present application, generating a decision tree model based on a training network set and an algorithm label set may include: calculating the information entropy of the training network set after classifying the training network set according to the link prediction algorithm corresponding to each community network according to the algorithm label set; determining a target information gain rate corresponding to each network structure feature in the plurality of network structure features based on the information entropy; taking the network structure characteristic with the maximum target information gain rate as a root node, and determining a branch threshold corresponding to the root node; branching the training network set according to the branching threshold value of the root node to obtain a first training network subset and a second training network subset; branching the first training network subset until the link prediction algorithm identifiers in the algorithm label subsets corresponding to the training network subsets obtained after branching are the same; and branching the second training network subset until the link prediction algorithm identifiers in the algorithm label subsets corresponding to the training network subsets obtained after branching are the same.
After the training network set and the algorithm label set are obtained, the decision tree model can be obtained through training. Can be calculated according to eachAnd (4) carrying out information entropy of the training network set after the training network set is classified by a link prediction algorithm corresponding to the community network. Entropy is one of the most common indicators used to measure the purity of a sample set. For example, T for the training network set. Assuming that k link prediction algorithms exist, classifying each training network in the training network set into the most suitable link prediction algorithm to obtain one of the training network set T which is divided into { S }1,S2,…,Sk}。Si(i ═ 1,2, … k) refers to the set of community networks classified in the training network set T to the ith link prediction algorithm. The entropy info (T) of information required for classifying the training network set T according to the link prediction algorithm applicable to the community network is obtained as follows:
Figure BDA0003050117460000061
wherein, PiTo a priori probability, Pi=|SiI/T, which represents the number of community networks in the training network set T, and SiL is SiThe number of the middle community networks.
In a decision tree algorithm, a feature may be selected based on the information gain, the greater the information gain, the better the selectivity of this feature. The information gain is defined in probability as: the difference between the entropy of the set to be classified and the conditional entropy of a selected feature. For example, the network in the training network set T may be divided according to the network structure feature a, and the target division point is defined as aiDividing T into 2 subsets T1,T2In which T is1Value V (A, T) of network structure characteristic A of medium network1)∈[A1,ai]In the same way, T2Value V (A, T) of network structure characteristic A of medium network2)∈(ai,aJ]. Corresponding to the type division, the information gain (a) of the network structure characteristic a is:
Gain(A)=Info(T)-InfoA(T);
wherein
Figure BDA0003050117460000062
Generally, the information gain criterion favors attributes with more attributes, that is, the information gain is used as a determination method, which tends to select the attributes with more attributes. However, too many attributes may render classification meaningless, and thus features for classification are generally selected based on information gain rate. Corresponding to the division point aiInformation Gain Ratio of A (A, alpha)i) Comprises the following steps:
Figure BDA0003050117460000063
wherein
Figure BDA0003050117460000064
For the network structure characteristic A, a plurality of segmentation points a can be determinediThereby obtaining a plurality of information gains, and the maximum information gain rate in the plurality of information gain rates can be determined as the target information gain rate of the network structure characteristic a. The plurality of division points may be a plurality of values randomly selected within a value range of the network structure characteristic a, or a plurality of values determined within the value range according to a preset algorithm.
Then, the network structure feature with the largest target information gain rate may be used as a root node, and a branch threshold corresponding to the root node may be determined. For example, the division point corresponding to the target information gain rate may be determined as the branch threshold corresponding to the root node. And branching the training network set according to the branch threshold value of the root node to obtain a first training network subset and a second training network subset. The value of the network structure feature corresponding to the root node of each community network in the first training network subset may be less than or equal to a branch threshold of the root node. The values of the network structure features corresponding to the root node for each community network in the second subset of training networks may be greater than the branch threshold of the root node.
After the first training network subset and the second training network subset are obtained, the first training network subset and the second training network subset may be branched based on the same method. The first training network subset may be branched until the link prediction algorithm identifiers in the algorithm label subsets corresponding to the training network subset obtained after branching are the same. The second training network subset may be branched until the link prediction algorithm identifiers in the algorithm label subsets corresponding to the training network subset obtained after branching are the same.
Taking the first training network subset as an example, the first algorithm label subset corresponding to the first training network subset may be obtained while obtaining the first training network subset. It may be determined whether the link prediction algorithm identifiers in the first subset of algorithm labels are the same identifier, that is, it is determined whether each community network in the first subset of training networks is suitable for the same link prediction algorithm, and if so, the branching is stopped. Otherwise, the first training network subset is branched according to the same method as the method for branching the training network set, and the difference is that the network structure characteristics corresponding to the root node do not need to be considered.
For example, when the training network set T is branched, the root node is determined to be the network structure feature a, and the branch threshold is a, the training network set T is divided into a first training network subset T1 and a second training network subset T2. The values of the network structure characteristics A in T1 are all less than or equal to a, and the values of the network structure characteristics A in T2 are all greater than a. The first subset of tags corresponding to T1 is L1. In the case that the link prediction algorithm identifiers in L1 are not all the same identifier, the information entropy S1 corresponding to T1 is determined according to the number of each identifier in L1. For T1, a target information gain rate corresponding to each of the network structure features other than a among the plurality of network structure features is determined based on S1, the network structure feature with the largest target information gain rate is taken as a root node of T1, and T1 is branched to obtain training network subsets T11 and T12. And repeating the steps until the algorithm identifications in the algorithm label subset corresponding to the final training network subset belong to the same algorithm identification, thereby obtaining the decision tree model.
In another embodiment, the attribute with the largest gain rate may not be directly selected as the partition attribute, but the attribute with the information gain lower than the average level is first removed by one-pass screening, and then the attribute with the highest information gain rate is selected from the remaining attributes, which is equivalent to both aspects.
In some embodiments of the present application, determining, based on the information entropy, a target information gain rate corresponding to each of the plurality of network structure features may include: performing ascending arrangement on a plurality of values of each network structure characteristic to obtain a characteristic sequence corresponding to each network structure characteristic, wherein each value in the plurality of values corresponds to each community network in the plurality of community networks; dividing a training network set by each division point in a plurality of division points in a characteristic sequence corresponding to each network structure characteristic, and calculating corresponding information gain rates based on information entropy to obtain a plurality of information gain rates corresponding to each network structure characteristic, wherein the plurality of division points in the characteristic sequence corresponding to each network structure characteristic comprise median values of two adjacent values in the characteristic sequence corresponding to each network structure characteristic; determining the maximum information gain rate in a plurality of information gain rates corresponding to each network structure characteristic as a target information gain rate corresponding to each network structure characteristic; correspondingly, determining a branch threshold corresponding to the root node includes: and taking the division point corresponding to the target information gain rate of the root node as a branch threshold corresponding to the root node.
Specifically, for each network structure feature, there is a value corresponding to each of the plurality of community networks in the training network set. The values corresponding to the network structure features may be sorted in ascending order to obtain the corresponding feature sequences. And determining the average value or the median value of two adjacent values in each characteristic sequence as a segmentation point to obtain a plurality of segmentation points corresponding to each network structure characteristic. And respectively taking each division point in the plurality of division points as a branch threshold value, dividing the training network set to obtain the information gain rate corresponding to each division point, thereby obtaining a plurality of information gain rates corresponding to each network structure characteristic. And determining the maximum information gain rate in the plurality of information gain rates corresponding to the network structure characteristics as the target information gain rate corresponding to the network structure characteristics. And determining the network structure characteristic with the maximum target information gain rate as a root node, and determining a division point corresponding to the target information gain rate of the root node as a branch threshold of the root node. And then, dividing the training network set according to the root node and the corresponding branch threshold value.
For example, in dividing the network in T by the network structure characteristic A (e.g., average clustering coefficient), the sequence { A is obtained by arranging the values of the network structure characteristic A in ascending order1,A2,…,AJAnd J is the number of values of the network structure characteristic A. Defining any i (i is more than or equal to 1 and less than or equal to J-1) division point as ai=(Ai+A(i+1))/2. The training network set T may be divided into 2 subsets T1,T2In which T is1Value V (A, T) of network structure characteristic A of medium network1)∈[A1,ai]In the same way as T2Value V (A, T) of network structure characteristic A of medium network2)∈(ai,AJ]. Corresponding to the type division, the information gain of the network structure characteristic A is as follows:
Gain(A)=Info(T)-InfoA(T);
wherein,
Figure BDA0003050117460000091
corresponding to the division point aiThe information gain ratio of a is:
Figure BDA0003050117460000092
wherein,
Figure BDA0003050117460000093
the characteristic sequence { A corresponding to the network structure characteristic A can be calculated1,A2,…,AJSelecting the division point with the maximum gain rate as the optimal branch threshold of the network structure characteristic A, namely threshold (A) max, corresponding to the information gain rate of each division point in the network structure characteristic A1≤i≤J-1{Gain_Ratio(A,ai)}. By the method, the segmentation points corresponding to the network structure characteristics can be conveniently determined, and the corresponding information gain rate is calculated, so thatAnd obtaining the target information gain rate corresponding to each network structure characteristic.
In some embodiments of the present application, the link prediction method may further include: acquiring a training sample set and a classification label set; constructing a support vector machine model on the training sample set and the classification label set so as to determine a generative algorithm of a target link prediction algorithm corresponding to the target community network by using the support vector machine model; the training sample set comprises a plurality of feature vectors, each feature vector in the plurality of feature vectors is used for representing the difference of scoring the same community network by each algorithm in a plurality of algorithm pairs for two corresponding link prediction algorithms, the classification label set is used for representing whether the two corresponding link prediction algorithms in the training sample set are mutual generative algorithms or not, each algorithm pair comprises two candidate link prediction algorithms, and the two link prediction algorithms are mutual generative algorithms under the condition that the accuracy of predicting the community network by combining the two link prediction algorithms is greater than the accuracy of predicting the community network by any one of the two link prediction algorithms.
Considering the performance degradation of the prediction model caused by overestimation or underestimation of different link prediction algorithms, the inventor provides a concept of a correlation algorithm, and performs link prediction on a target community network by using a target link prediction algorithm and the correlation algorithm of the algorithm. Under the condition that the accuracy rate of link prediction of a certain community network by combining two link prediction algorithms is greater than the accuracy rate of prediction of the community network by any one of the two link prediction algorithms, the two link prediction algorithms are mutual generative algorithms relative to the community network. For example, for the community network M, the accuracy of the link prediction by combining the link prediction algorithms S1 and S2 is greater than the accuracy of the link prediction by using only the link prediction algorithm S1 or only the link prediction algorithm S2, and then the algorithms S1 and S2 are mutual generative algorithms for the community network M.
Because a Support Vector Machine (SVM) has many specific advantages in solving small sample, nonlinear and high-dimensional pattern recognition, an SVM model can be used to screen a coherent algorithm complementary to the advantages of a target link prediction algorithm selected by a decision tree model.
In this embodiment, a training sample set and a classification label set may be obtained. The training sample set may include a plurality of feature vectors, and each feature vector in the plurality of feature vectors is used to characterize a difference in which each algorithm in the plurality of algorithm pairs scores the same community network for two corresponding link prediction algorithms. Various metrics may be selected to characterize the scoring differences between algorithms, and the various metrics may include at least one of: euclidean distance, normalized euclidean distance, manhattan distance, chebyshev distance, cosine distance, correlation distance, spearman distance. The classification label set can be used for representing whether the two link prediction algorithms corresponding to the algorithm pairs corresponding to the feature vectors in the training sample set are mutual generative algorithms, and each algorithm pair comprises two link prediction algorithms in a plurality of candidate link prediction algorithms.
After the training sample set and the classification label set are obtained, a support vector machine model can be trained by using the training sample set and the classification label set. Then, a support vector machine model can be used for determining a generation algorithm of a target link prediction algorithm corresponding to the target community network. When link prediction is performed on a target community network, a decision tree model can be used for determining a target link prediction algorithm suitable for the target community network, and a support vector machine model is used for determining a generative algorithm corresponding to the target link prediction algorithm corresponding to the target community network. And then, link prediction can be carried out on the target community network by combining the target link prediction algorithm and the corresponding intergeneration algorithm, so that the accuracy of the link prediction can be further improved.
In some embodiments of the present application, the generating algorithm for determining the target link prediction algorithm corresponding to the target community network by using the support vector machine model may include: performing link prediction on the target community network by using a target link prediction algorithm to obtain a target link prediction result; performing link prediction on the target community network by using each algorithm in the multiple candidate link prediction algorithms to obtain multiple candidate link prediction results; determining a plurality of eigenvectors according to the target link prediction result and the candidate link prediction results, wherein each eigenvector in the eigenvectors corresponds to each algorithm in the candidate link prediction algorithms; and inputting the plurality of feature vectors into a support vector machine model, and determining whether each algorithm in the plurality of candidate link prediction algorithms is a coherent algorithm of the target link prediction algorithm.
Specifically, a target link prediction algorithm may be used to perform link prediction on the target community network to obtain a target link prediction result. The target community network can be subjected to link prediction by using each of multiple candidate link prediction algorithms to obtain multiple candidate link prediction results. The link prediction result may be a score given to each of all pairs of unconnected nodes in the target community network by using a certain link prediction algorithm. A plurality of feature vectors may be determined based on the target link predictor and the plurality of candidate link predictors. Each of the plurality of eigenvectors corresponds to each of the plurality of candidate link prediction algorithms. Each of the plurality of feature vectors may characterize a difference between the target link prediction algorithm and each of the plurality of candidate link prediction algorithms scoring the target community network. By inputting each of the plurality of feature vectors into the support vector machine model, a plurality of classification results can be output. Each classification result in the multiple classification results can be used for representing whether the candidate link prediction algorithm corresponding to each feature vector is the generation algorithm of the target link prediction algorithm, so that the generation algorithm of the target link prediction algorithm can be obtained. Through the method, the generative algorithm of the target link algorithm can be determined aiming at the target community network.
In some embodiments of the present application, performing link prediction on a target community network according to a target link prediction algorithm may include: generating a flexible link prediction model according to the target link prediction algorithm and the generation algorithm of the target link prediction algorithm; and performing link prediction on the target community network based on a flexible link prediction model. For example, the target link prediction algorithm may be added to one or more of its phasor generation algorithms to generate a flexible link prediction model. The flexible link prediction model is used for carrying out link prediction on the community network, and the accuracy of the link prediction can be improved.
In some embodiments of the present application, the flexible link prediction model is:
S=w·(B,E1,E2,…,Ei)T
s is a flexible link prediction model, and w is a vector formed by weights corresponding to a target link prediction algorithm and a coherent algorithm of the target link prediction algorithm; b is a target link prediction algorithm, E1,E2,…,EiI is the number of the phase generation algorithm of the target link prediction algorithm, and superscript T is matrix transposition. In one embodiment, the weight of the target link prediction algorithm and the corresponding generative algorithm may be determined according to a total score of the target link prediction algorithm and the corresponding generative algorithm scoring each unconnected edge node in the target community network. The higher the total score, the greater the weight of the corresponding algorithm. In another embodiment, in the process of training the decision tree model, the average score of each link prediction algorithm is obtained, and the average score of the target link prediction algorithm and the corresponding intergenic algorithm can be used as the weight of the flexible link prediction model. In the above embodiment, the accuracy of the link prediction can be further improved by considering the weight of each algorithm, compared with simply adding the target link prediction algorithm and the generative algorithm.
The above method is described below with reference to a specific example, however, it should be noted that the specific example is only for better describing the present application and is not to be construed as limiting the present application.
In this embodiment, the link prediction algorithm may also be referred to as a scored link prediction algorithm. The main idea of the scoring link prediction algorithm is that the greater the node scoring link calculated by a formula, the higher the probability of generating a link. A common principle of all scored link prediction algorithms is that how much of the resulting link score is not important, and what is important is the ordering of the link scores. The seven most commonly used scoring link prediction algorithms such as Salton, Sorenson, HPI (hub promoted index), HDI (hub suppressed index), LHN, PA, RA, etc. can be selected, and are specifically shown in table 1.
TABLE 1
Figure BDA0003050117460000121
Furthermore, in this embodiment, 5 algorithms are newly proposed according to the idea of resource allocation. Hereinafter, Γ (x) represents the neighbors of node x, | · | represents the number of ·, | Γ (x) # Γ (y) | represents the number of neighbors common to node x and node y, and k (x) represents the degree of node x. The newly proposed 5 new resource allocation scoring algorithms are as follows: a resource secondary allocation algorithm (WA1), a node-based network location allocation algorithm (WA2), a node resource dispersion algorithm (WA3), a neighbor node resource dispersion algorithm (WA4), and a combined algorithm (RWA) of the above four algorithms. The five algorithms are described below.
WA 1: taking the nodes connected with the common neighbors as resource transmitters, and expressing the divided resources by the nodes as the sum of the reciprocal numbers of the node degrees connected with the common neighbors, namely:
Figure BDA0003050117460000122
WA 2: considering the clustering coefficient of the common neighbor node, the larger the clustering coefficient is, the more the neighbor node is in the central position in the network, the more easily the resource dispersion is caused, that is:
Figure BDA0003050117460000123
where c (z) represents the clustering coefficient, p,
Figure BDA0003050117460000124
representing the algorithm parameters.
WA 3: the situation of re-sharing after the nodes share the resources is considered, namely:
Figure BDA0003050117460000131
WA 4: the characteristics of the resource dispersion of the nodes and the energy dispersion of the common neighbor nodes are combined, namely:
Figure BDA0003050117460000132
RWA: combining the combined algorithms obtained by RA, WA2 and WA3, namely:
Figure BDA0003050117460000133
wherein, alpha, beta and gamma are respectively
Figure BDA0003050117460000134
The weight parameter of (2).
In this embodiment, a Flexible Link Prediction Model (FLPM) is provided. The FLPM considers the relation between the joint effect of various characteristics in the network and the algorithm prediction performance, selects a decision tree model, selects a proper link prediction algorithm according to the network characteristics, and in addition, because a single link prediction algorithm frequently makes too high or too low estimation, the random combination link prediction algorithm cannot ensure that the good effect is obtained every time, and a support vector machine model is introduced to identify the generation algorithm, so that the high-efficiency combined flexible link prediction model is generated. For any scoring algorithm a, the generative algorithm is: and adding scores of the two algorithms to serve as new scores of the node pairs for all the node pairs in the training sample, and sequencing, wherein if the number of the nodes which are actually arranged in front and have connected edges is more than that of the node A, the algorithm is considered as the generative algorithm of the node A.
Referring to fig. 2, a flow chart of a link prediction method in an embodiment of the present application is shown. As shown in fig. 2, a Decision Tree (DT) model is first trained using training network data. The basic idea of DT is to select the SLPA with the highest precision according to the structural characteristics of the community network, such as the average shortest path, the average degree, the node betweenness, the link betweenness and the average cluster coefficient. Referring to fig. 3, a schematic diagram of a decision tree model in a link prediction method in an embodiment of the present application is shown.
In order to evaluate the prediction performance of the index, in this embodiment, AUC (area under the receiver operating characteristic curve) is used as a standard for measuring the link prediction accuracy. AUC may be understood as randomly selecting a continuous edge in the test set, comparing it to a randomly selected non-existent edge score value, and in m independent comparisons, if the edge in the test set scores m1 times higher, then the AUC value is:
Figure BDA0003050117460000135
AUC may be understood as the probability that the score value of an edge in a test set has a higher value than the score value of a randomly selected one absent edge. When the network scale is large, the AUC value obtained by the random sampling mode can reduce the calculation complexity and improve the calculation efficiency. It is clear that the greater the AUC value, the higher the algorithm accuracy.
The independent variable of the training sample can be defined as the network structure characteristic, and the dependent variable is the SLPA with the maximum AUC value.
The DT algorithm can be designed according to the C4.5 algorithm concept. The algorithm is described in detail as follows:
the whole training network set is T, for each network, according to the AUC value of each network on each SLPA, the network is distributed to the class corresponding to the SLPA with the maximum AUC, K SLPAs are assumed, and one obtained T is divided into { S }1,S2,…,Sk}. A priori probability of Pi=|SiIf | T | represents the number of networks in the data set, the entropy of the information required for classifying T is
Figure BDA0003050117460000141
The networks in T are divided according to the network structure characteristics A (such as average clustering coefficients), and the sequence { A is obtained by arranging the values of the network structure characteristics A in ascending order1,A2,…,AJ}. Defining any i (i is more than or equal to 1 and less than or equal to J-1) division point as ai=(Ai+A(i+1)) Per 2, divide T into 2 subsets T1,T2Where T is1Value V (A, T) of network structure characteristic A of medium network1)∈[A1,ai]In the same way as V (A, T)2)∈(ai,AJ]. Corresponding to the type division, the information gain of the network structure characteristic A is as follows:
Gain(A)=Info(T)-InfoA(T);
wherein
Figure BDA0003050117460000142
Corresponding to the division point aiThe information gain ratio of a is:
Figure BDA0003050117460000143
wherein,
Figure BDA0003050117460000144
the signature sequence A can be calculated1,A2,…,AJSelecting the division point with the maximum gain rate as the optimal branch threshold of the network structure characteristic A, namely threshold (A) max, corresponding to the information gain rate of each division point in the network structure characteristic A1≤i≤J-1{Gain_Ratio(A,ai)}。
The main steps of the DT model training process proposed in this embodiment are as follows:
step1, calculating the structural characteristics of a community network such as an average shortest path, an average degree, a node betweenness, a link betweenness, an average clustering coefficient and the like, calculating the information gain rate of each characteristic attribute, selecting the attribute with the maximum information gain rate as a root node, and branching according to the optimal branching threshold value of the root node;
step 2, recursively establishing branches of the tree by adopting the same method as Step1 according to data subsets corresponding to branches with different node attributes, and circulating the steps until samples in all branch nodes select the same SLPA;
and 3, extracting a decision rule. For the DT generated by Step 2, a decision rule can be directly obtained, namely, the optimal SLPA suitable for the network is selected according to the structural characteristics of the average shortest path, the average degree, the node betweenness, the link betweenness, the average cluster coefficient and the like of the community network.
In the process of training the DT model, the average AUC value of each SLPA is obtained, and the average AUC value can be used as the weight of the flexible link prediction model.
With continued reference to fig. 2, the best algorithm's generative algorithm may be determined using an SVM model. Because the SVM model shows a plurality of specific advantages in solving the problems of small samples, nonlinearity and high-dimensional pattern recognition, the SVM model is adopted to screen the coherent SLPA which is complementary with the optimal SLPA selected by the DT model. The SVM model judges whether the two algorithms are generated according to the grading difference of the two algorithms on each node pair. The scoring difference between algorithms can be characterized by selecting 7 indexes, namely Euclidean distance, standardized Euclidean distance, Manhattan distance, Chebyshev distance, cosine distance, correlation distance and Spireman distance.
Given a training sample set (x)i,yi) I ═ 1,2, …, l, the x vector represents the above 7 distances scored by the two algorithms, y ∈ { -1, 1}, 1 represents that the two algorithms are coherent algorithms, and-1 represents not. The hyperplane is denoted as (w · x) + b as 0, and for the linear irreducible case, the main idea of SVM is to map the input vector to a high-dimensional feature vector space and construct the optimal classification plane in the feature space. To improve algorithm efficiency, a non-linear Gaussian function (RBF) K (x) can be selectedi,x)=exp(-‖x-xi22) To do the kernel function. For an input vector z, the optimal classification function can be obtained as:
Figure BDA0003050117460000151
wherein a, b and delta are constants.
In order to further improve the accuracy of the prediction,FLPM designs a combined prediction model which mainly comprises a best scoring link prediction algorithm B screened out by DT and a coherent algorithm (namely E) of B identified by SVM1,E2,…,Ei) And the like. The specific description is as follows:
S=w*(B,E1,E2,…,Ei)T
wherein S is the flexible link prediction model, and w is a vector formed by the target link prediction algorithm and the weight corresponding to the generative algorithm of the target link prediction algorithm; b is the target link prediction algorithm, E1,E2,…,EiI is the number of the phase generation algorithms of the target link prediction algorithm, and superscript T is matrix transposition.
In the above specific embodiment, the FLPM is provided to implement friend group recommendation, and further, personalized marketing is implemented by utilizing influence of friend groups. The model overcomes the defect that the prediction effect is unstable in practice due to the fact that a large amount of expert experience is needed to select a proper algorithm in the traditional SLAP. The DT model is adopted to adaptively select the SLPA suitable for the network from the designed algorithm set according to the network structure characteristics rather than subjective experience. In addition, in order to overcome the defect of performance reduction of a combined prediction model caused by frequent over-prediction or under-estimation of different SLPAs, a coherent model identification method based on an SVM is provided, and advantages of the SLPAs in the FLPM combined model are complemented. Thus, FLPM provides strong support for marketers to optimally utilize an online community for a certain class of products to improve marketing efficiency.
In order to better understand the present solution and the beneficial effects thereof, an experimental process of performing link prediction by using the link prediction method in the above specific embodiment is given below.
In this experiment, 971 ego-net datasets from twitter, offered by Stanford university, were chosen to validate the FLPM. ego-net is a community network composed of users and their fans, the central node of the network can represent the brand enterprise, and the data set divides the community members into circles which like different products. In each experiment, 777 networks were randomly selected from 971 networks as training sets, and the other 194 networks were used as test sets. S1-S25 can be used to represent 25 algorithms, with Table 2 giving the algorithm without parameters and Table 3 giving the algorithm with parameters.
TABLE 2
Figure BDA0003050117460000161
TABLE 3
Figure BDA0003050117460000162
In the DT experiment, 25 algorithms were screened based on training network set data, and 15 algorithms, S6, S7, S8, S9, S10, S11, S13, S15, S16, S17, S19, S20, S23, S24, and S25 were selected preferably. Aiming at each optimized algorithm, training SVM by using a training set so as to find out the generative algorithm of the SVM, wherein 15 algorithms correspond to 15 SVM. Then, according to the formula S ═ w ═ (B, E)1,E2,…,Ei)TFLPM is generated.
To evaluate the predicted performance of the algorithm, 100 experiments were performed on the test set, and the average AUC value was taken as an index for evaluating the computational performance, with the results shown in table 4.
TABLE 4
Figure BDA0003050117460000171
A comparison of the performance of all algorithms is given in Table 4, where FLPM1 represents FLPMs with different weight combinations, FLPM2 represents FLPMs without weight combinations, i.e. weights are all 1, and single DT represents the algorithm selected directly with the DT model for prediction. Fig. 4 shows a comparison graph of predicted performance of various scoring link prediction algorithms, which shows a comparison of performance of a classical SLPA and five resource allocation-based SLPAs proposed in this chapter. Fig. 5 shows a prediction performance comparison diagram of the flexible link prediction model, the target link prediction algorithm selected by the decision tree model, and the preferred scored link prediction algorithm, that is, FLPM, DT and the preferred SLPA performance comparison are given.
As can be seen from fig. 4, the accuracy of 18 algorithms, i.e., S7, S9, S10, S11, S12, S14, S15, S16, S17, S18, S19, S20, S21, S22, S23, S24, and S25, is significantly higher than that of the other 7 algorithms, wherein except for S7, the algorithms are new algorithms proposed by the present application based on the resource allocation concept. As can be seen from fig. 5, the FLPM boosting effect is significant compared to these better performing algorithms. It can also be observed from fig. 5 that the weighted combination FLPM1 is better than the unweighted combination FLPM2, which indicates that the present embodiment proposes that setting weights based on AUC is effective, and that the performance of FLPM1 and FLPM2 is better than that of DT and other combination algorithms, which indicates that the mechanism proposed by the present embodiment for selecting the phasor generation algorithm by using the SVM model is efficient. Finally, it can be seen from table 4 that the DT accuracy is higher than that of other non-combinational algorithms, which illustrates that the mechanism proposed in this embodiment for selecting a suitable link prediction algorithm based on the network structure characteristics is effective. These results indicate that FLPM can provide reliable and accurate friend recommendation predictions for users across different product friend groups of a brand community.
Further, the accuracy of the FLPM's friend circle network in the brand community due to the use of different classes of products can be analyzed. And selecting networks with the product class number from 2 to 19 according to the network size, and analyzing the accuracy of connection establishment between different recommended product circles by the FLPM. The prediction accuracy is given in table 5. It can be seen from table 5 that FLPM has high recommendation accuracy, with an average value of 0.9003, in both networks with more or fewer turns of product. Referring to FIG. 6, the relationship between the number of nodes in the circle and the AUC is shown. It can be seen from fig. 6 that FLPM accuracy increases as the number of nodes in the circle increases, which illustrates that the performance of FLPM friend group recommendations is superior for those brand community networks where there may be more overlapping nodes.
TABLE 5
Figure BDA0003050117460000181
Based on the same inventive concept, the embodiment of the present application further provides a link prediction apparatus, as described in the following embodiments. Because the principle of the link prediction apparatus for solving the problem is similar to the link prediction method, the link prediction apparatus can be implemented by the link prediction method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated. Fig. 7 is a block diagram of a structure of a link prediction apparatus according to an embodiment of the present application, as shown in fig. 7, including: the following describes the structure of the acquisition module 701 and the generation module 702.
The obtaining module 701 is configured to obtain a training network set and an algorithm label set, where the training network set includes a plurality of network structure features of each of a plurality of community networks, and the algorithm label set includes a link prediction algorithm identifier corresponding to each of the plurality of community networks in the training network set.
The generating module 702 generates a decision tree model based on the training network set and the algorithm tag set to determine a target link prediction algorithm corresponding to the target community network by using the decision tree model and a plurality of network structure features of the target community network, and performs link prediction on the target community network according to the target link prediction algorithm.
From the above description, it can be seen that the embodiments of the present application achieve the following technical effects: the decision tree model is trained according to the network structure characteristics of the community network and the corresponding scoring link prediction algorithm, so that the scoring link prediction algorithm suitable for the community network can be determined according to the network structure characteristics of the target community network by using the decision tree model, the scoring link prediction algorithm suitable for various community networks can be quickly selected without subjective experience, the reliability is high, and the labor cost and the time cost can be saved. And then, the selected scoring link prediction algorithm can be used for carrying out link prediction on the community network, the prediction accuracy is high, and the prediction result can be applied to the prediction of friend links between friend group node pairs of the community network, so that the influence of groups on individuals can be utilized, and the recommendation efficiency of a certain product or service can be improved by best utilizing the group effect.
The embodiment of the present application further provides a computer device, which may specifically refer to a schematic diagram of a structure of a computer device based on the link prediction method provided in the embodiment of the present application shown in fig. 8, where the computer device may specifically include an input device 81, a processor 82, and a memory 83. Wherein the memory 83 is configured to store processor-executable instructions. The processor 82, when executing the instructions, performs the steps of the link prediction method described in any of the embodiments above.
In this embodiment, the input device may be one of the main apparatuses for information exchange between a user and a computer system. The input device may include a keyboard, a mouse, a camera, a scanner, a light pen, a handwriting input board, a voice input device, etc.; the input device is used to input raw data and a program for processing the data into the computer. The input device can also acquire and receive data transmitted by other modules, units and devices. The processor may be implemented in any suitable way. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The memory may in particular be a memory device used in modern information technology for storing information. The memory may include multiple levels, and in a digital system, the memory may be any memory as long as it can store binary data; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.
In this embodiment, the functions and effects of the specific implementation of the computer device can be explained in comparison with other embodiments, and are not described herein again.
There is also provided in an embodiment of the present application a computer storage medium based on a link prediction method, where the computer storage medium stores computer program instructions, and the computer program instructions, when executed, implement the steps of the link prediction method in any of the above embodiments.
In this embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the application should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the pending claims along with the full scope of equivalents to which such claims are entitled.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and it will be apparent to those skilled in the art that various modifications and variations can be made in the embodiment of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (11)

1. A method of link prediction, comprising:
acquiring a training network set and an algorithm label set, wherein the training network set comprises a plurality of network structure characteristics of each community network in a plurality of community networks, and the algorithm label set comprises link prediction algorithm identifications corresponding to each community network in the plurality of community networks in the training network set;
generating a decision tree model based on the training network set and the algorithm label set, so as to determine a target link prediction algorithm corresponding to the target community network by using the decision tree model and a plurality of network structure characteristics of the target community network, and performing link prediction on the target community network according to the target link prediction algorithm.
2. The method of claim 1, wherein obtaining a set of training networks and a set of algorithmic labels comprises:
acquiring a training network set, wherein the training network set further comprises a node set and an edge set of each community network in a plurality of community networks;
determining unconnected edge node pairs in each community network according to the node set and the edge set of each community network;
assigning score values to unconnected node pairs in each community network by using each candidate link prediction algorithm in multiple candidate link prediction algorithms to obtain multiple total score values corresponding to each community network, wherein each total score value in the multiple total score values corresponding to each community network corresponds to each candidate link prediction algorithm;
and determining the identifier of the candidate link prediction algorithm corresponding to the maximum total score value in the total score values corresponding to the community networks as the algorithm label corresponding to the community networks to obtain an algorithm label set.
3. The method of claim 1, wherein generating a decision tree model based on the training network set and the algorithmic label set comprises:
calculating the information entropy of the training network set after the training network set is classified according to the link prediction algorithm corresponding to each community network according to the algorithm label set;
determining a target information gain rate corresponding to each network structure feature in the plurality of network structure features based on the information entropy;
taking the network structure characteristic with the maximum target information gain rate as a root node, and determining a branch threshold corresponding to the root node;
branching the training network set according to the branching threshold value of the root node to obtain a first training network subset and a second training network subset;
branching the first training network subset until the link prediction algorithm identifiers in the algorithm label subsets corresponding to the training network subsets obtained after branching are the same; and branching the second training network subset until the link prediction algorithm identifiers in the algorithm label subsets corresponding to the training network subsets obtained after branching are the same.
4. The method of claim 3, wherein determining a target information gain rate for each of the plurality of network structure features based on the information entropy comprises:
performing ascending arrangement on the multiple values of each network structure characteristic to obtain a characteristic sequence corresponding to each network structure characteristic, wherein each value in the multiple values corresponds to each community network in the multiple community networks;
dividing the training network set by each division point in a plurality of division points in the feature sequence corresponding to each network structure feature, and calculating corresponding information gain rates based on the information entropy to obtain a plurality of information gain rates corresponding to each network structure feature, wherein the plurality of division points in the feature sequence corresponding to each network structure feature comprise median values of two adjacent values in the feature sequence corresponding to each network structure feature;
determining the maximum information gain rate in the plurality of information gain rates corresponding to the network structure features as a target information gain rate corresponding to the network structure features;
correspondingly, determining the branch threshold corresponding to the root node includes:
and taking the division point corresponding to the target information gain rate of the root node as a branch threshold corresponding to the root node.
5. The method of claim 1, further comprising:
acquiring a training sample set and a classification label set;
constructing a support vector machine model based on the training sample set and the classification label set, and determining a coherent algorithm of a target link prediction algorithm corresponding to the target community network by using the support vector machine model;
the training sample set comprises a plurality of feature vectors, each feature vector in the plurality of feature vectors is used for representing the difference of the scoring of the same community network by two link prediction algorithms corresponding to each algorithm in a plurality of algorithm pairs, the classification label set is used for representing whether the two link prediction algorithms corresponding to each feature vector in the training sample set are mutual generative algorithms, each algorithm pair comprises two link prediction algorithms in a plurality of candidate link prediction algorithms, and the two link prediction algorithms are mutual generative algorithms under the condition that the accuracy of predicting the community network by combining the two link prediction algorithms is greater than the accuracy of predicting the community network by any one of the two link prediction algorithms.
6. The method of claim 5, wherein determining a generative algorithm of a target link prediction algorithm corresponding to the target community network using the support vector machine model comprises:
performing link prediction on the target community network by using the target link prediction algorithm to obtain a target link prediction result;
performing link prediction on the target community network by using each algorithm in the multiple candidate link prediction algorithms to obtain multiple candidate link prediction results;
determining a plurality of eigenvectors according to the target link prediction result and the plurality of candidate link prediction results, wherein each eigenvector in the plurality of eigenvectors corresponds to each algorithm in the plurality of candidate link prediction algorithms;
and inputting the plurality of feature vectors into the support vector machine model, and determining whether each algorithm in the plurality of candidate link prediction algorithms is a coherent algorithm of the target link prediction algorithm.
7. The method of claim 5, wherein performing link prediction for the target community network according to the target link prediction algorithm comprises:
generating a flexible link prediction model according to the target link prediction algorithm and a correlation algorithm of the target link prediction algorithm;
and performing link prediction on the target community network based on the flexible link prediction model.
8. The method of claim 7, wherein the flexible link prediction model is:
S=w·(B,E1,E2,…,Ei)T
wherein S is the flexible link prediction model, and w is a vector formed by the target link prediction algorithm and the weight corresponding to the generative algorithm of the target link prediction algorithm; b is the target link prediction algorithm, E1,E2,…,EiI is the number of the phase generation algorithms of the target link prediction algorithm, and superscript T is matrix transposition.
9. A link prediction apparatus, comprising:
the system comprises an acquisition module, a calculation module and a calculation module, wherein the acquisition module is used for acquiring a training network set and an algorithm label set, the training network set comprises a plurality of network structure characteristics of each community network in a plurality of community networks, and the algorithm label set comprises a link prediction algorithm identification corresponding to each community network in the plurality of community networks in the training network set;
and the generating module is used for generating a decision tree model based on the training network set and the algorithm label set so as to determine a target link prediction algorithm corresponding to the target community network by utilizing the decision tree model and a plurality of network structure characteristics of the target community network and perform link prediction on the target community network according to the target link prediction algorithm.
10. A computer device comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the steps of the method of any one of claims 1 to 8.
11. A computer-readable storage medium having computer instructions stored thereon which, when executed, implement the steps of the method of any one of claims 1 to 8.
CN202110485583.1A 2021-04-30 2021-04-30 Link prediction method and device Pending CN113033709A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110485583.1A CN113033709A (en) 2021-04-30 2021-04-30 Link prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110485583.1A CN113033709A (en) 2021-04-30 2021-04-30 Link prediction method and device

Publications (1)

Publication Number Publication Date
CN113033709A true CN113033709A (en) 2021-06-25

Family

ID=76454893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110485583.1A Pending CN113033709A (en) 2021-04-30 2021-04-30 Link prediction method and device

Country Status (1)

Country Link
CN (1) CN113033709A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115278811A (en) * 2022-07-28 2022-11-01 中国科学院计算技术研究所 MPTCP connection path selection method based on decision tree model
CN117151279A (en) * 2023-08-15 2023-12-01 哈尔滨工业大学 Isomorphic network link prediction method and system based on line graph neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115278811A (en) * 2022-07-28 2022-11-01 中国科学院计算技术研究所 MPTCP connection path selection method based on decision tree model
CN117151279A (en) * 2023-08-15 2023-12-01 哈尔滨工业大学 Isomorphic network link prediction method and system based on line graph neural network

Similar Documents

Publication Publication Date Title
Solus et al. Consistency guarantees for greedy permutation-based causal inference algorithms
JP5521881B2 (en) Image identification information addition program and image identification information addition device
WO2017003666A1 (en) Method and apparatus for large scale machine learning
Carbonera et al. A density-based approach for instance selection
Cabreros et al. Detecting community structures in hi-c genomic data
Sheng et al. Adaptive multisubpopulation competition and multiniche crowding-based memetic algorithm for automatic data clustering
Tran et al. Community detection in partially observable social networks
Cao et al. An improved collaborative filtering recommendation algorithm based on community detection in social networks
US9455874B2 (en) Method and apparatus for detecting communities in a network
US10635991B2 (en) Learning method, information processing device, and recording medium
CN113033709A (en) Link prediction method and device
Jeong et al. PRIME: a probabilistic imputation method to reduce dropout effects in single-cell RNA sequencing
CN112015898A (en) Model training and text label determining method and device based on label tree
CN111461440B (en) Link prediction method, system and terminal equipment
Guo et al. Generalized global ranking-aware neural architecture ranker for efficient image classifier search
Yan et al. A novel clustering algorithm based on fitness proportionate sharing
Khajehnejad SimNet: Similarity-based network embeddings with mean commute time
CN104899232B (en) The method and apparatus of Cooperative Clustering
Gias et al. Samplehst: Efficient on-the-fly selection of distributed traces
CN116883786A (en) Graph data augmentation method, device, computer equipment and readable storage medium
Yabas et al. Churn prediction in subscriber management for mobile and wireless communications services
Liu et al. A weight-incorporated similarity-based clustering ensemble method
CN109885758B (en) Random walk recommendation method based on bipartite graph
Wang et al. A novel trace clustering technique based on constrained trace alignment
CN111126443A (en) Network representation learning method based on random walk

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination