CN118055030A

CN118055030A - Propagation network reconstruction method, system, storage medium and equipment

Info

Publication number: CN118055030A
Application number: CN202410439289.0A
Authority: CN
Inventors: 黄浩
Original assignee: Jiangxi Qiushi Higher Research Institute
Current assignee: Jiangxi Qiushi Higher Research Institute
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-05-17

Abstract

The application provides a propagation network reconstruction method, a system, a storage medium and equipment, wherein the method comprises the steps of obtaining historical observation data of a network node to obtain a historical observation data set; calculating scoring functions of every two nodes in the network according to the historical observation data set to obtain a scoring set, and calculating candidate father nodes of each node in the scoring set to obtain a candidate father node set; calculating the final parent node of each node by combining the candidate parent nodes, and obtaining a final parent node set according to the final parent node of each node; and generating directed edges from the father nodes to the nodes according to each node and each father node of the final father node set corresponding to the nodes, thereby obtaining a propagation network topological structure to reconstruct the propagation network structure. The application can reconstruct the network only by the node state at the last observation, and does not need to relate to the exact occurrence time and infection state of each node infection; the application also needs no probability estimation, so that more accurate results can be obtained under a small number of data samples.

Description

Propagation network reconstruction method, system, storage medium and equipment

Technical Field

The present invention relates to the field of information propagation technologies, and in particular, to a propagation network reconstruction method, system, storage medium, and device.

Background

The propagation network is a mathematical model for researching the propagation rule of substances and information. The transmission of views, rumors, and diseases is typically modeled as a probabilistic process over the transmission network. In a network, directed edges represent parent-child relationships, which parent nodes can influence child nodes with a certain probability. Propagation network structure reconstruction aims at deducing the propagation network structure (i.e. the topology that affects the relationships) from the observed data. In most cases, this effect is not visible, and only a limited number of history propagation processes can be observed. The problem of how to restore an accurate propagation network structure from a limited number of historical propagation process data is of considerable interest in the fields of social networks, virus marketing, epidemic prevention and the like, because the reconstructed propagation network structure can intuitively reveal potential interactions between nodes, is crucial for formulating strategies for controlling future propagation processes, and can help researchers to better predict, promote or organize future substance and information propagation.

However, in conventional propagation network reconstruction, on the one hand, existing reconstruction methods assume that the observed data contains the exact time of occurrence of each node infection and contains the infection status of the node during each diffusion; in less ideal and more realistic environments, the time information of node infection is unknown. On the other hand, the existing propagation network reconstruction is often based on a probability method, and the result can be absolutely accurate only when the known data tends to be infinite, so that probability needs to be estimated, accurate result can be obtained only under a large amount of sample data, and the influence of the sample data amount is large.

Disclosure of Invention

Based on this, the present invention aims to provide a method, a system, a storage medium and a device for reconstructing a propagation network, which are used for solving the technical problems that in the prior art, the propagation network reconstruction needs to rely on the exact occurrence time of each node and a relatively accurate result needs to be obtained under a large amount of sample data.

In one aspect, the present invention provides a method for reconstructing a propagation network, including:

Acquiring historical observation data of a network node to obtain a historical observation data set;

Calculating a scoring function f (v _i,v_y) of every two nodes in the network according to the historical observation dataset to obtain a scoring set, and calculating candidate father nodes of every node in the scoring set to obtain a candidate father node set C, wherein v _i and v _y are nodes;

calculating the final parent node of each node by combining the candidate parent nodes, and obtaining a final parent node set F according to the final parent node of each node;

Generating a directed edge from a parent node to a node v _i according to each parent node of the final parent node set F corresponding to each node v _i and the node v _i, thereby obtaining a propagation network topology structure to reconstruct the propagation network structure;

Wherein the step of calculating a final parent node of each node in combination with the candidate parent nodes includes:

Obtaining a preset queue Q, a preset set T and a current upper score limit g _max, wherein the set T is an empty set initially, and q= { -The current upper score limit g _max is preset to minus infinity- ≡;

Judging whether the queue Q is an empty set or not;

If not, acquiring an element E from the queue Q, and recording the sequence number j of a corresponding node of the element E;

The j+1st node is taken out from the candidate father node set of the node v _i and is marked as the node v _p, and the set T is added, wherein the final father node F _i of the preset node v _i is an empty set ；

Calculating a scoring function f (T, v _i);

Judging whether the scoring function f (T, v _i) is larger than the current scoring upper limit g _max;

If the score is greater than the current score upper limit g _max, updating the score upper limit g _max according to the score function F (T, v _i), updating the final parent node F _i according to the set T, and calculating the score upper limit function g according to the updated final parent node F _i (C _i,T, v_i);

judging whether the score upper limit function g (C _i,T, v_i) is larger than the score upper limit g _max;

If the score is greater than the score upper limit g _max, the node v _p joins the queue Q, and returns to execute the step of judging whether the queue Q is empty or not until the number of candidate parent nodes of v _i is reached, so as to obtain a final parent node set F of any node v _i.

According to the propagation network reconstruction method, the propagation network is reconstructed, so that the network can be reconstructed only by the node state in the final observation, and the exact occurrence time of each node infection and the node infection state in each diffusion process are not required to be related; furthermore, the technical scheme of the application does not need to estimate the probability, but converts the probability problem into the sampling problem, and the dependence on the number of samples is eliminated, so that a more accurate result can be obtained under a small number of data samples; the method solves the technical problem that the reconstruction of the propagation network in the prior art needs to depend on the exact occurrence time of each node and can obtain more accurate results under a large amount of sample data.

In addition, the propagation network reconstruction method according to the present invention may further have the following additional technical features:

Further, the calculation formula of the scoring function f (C, v _i) is:

；

Where Tr (-) represents the trace of the matrix, J is an n-dimensional matrix, n is the number of entries of the propagation network node historical observation dataset, j=i-1/n, I represents an n-dimensional identity matrix; k _C and K _Vi are n-dimensional symmetric matrices, the ith row and the jth column of K _C are inner products of the ith record and the jth record of the node represented by the set C, and the ith row and the jth column of K _Vi are inner products of the ith record and the jth record of the node v _i.

Further, the calculation formula of the scoring upper limit function is:

；

Where n is the number of records, delta _n represents reordering the order of n records, n is the total of the ≡ ordering, δ is one of the reordering schemes, (v _i)_δ represents delta reordering of n records of node v _i).

Further, the step of calculating a scoring function f (v _i,v_y) of every two nodes in the network according to the historical observation dataset to obtain a scoring set includes: the scoring function f (v _i,v_y) of every two nodes in the network is calculated to obtain n (n-1)/2 scores to obtain a scoring set, and the scoring set is marked as { f _1,2,f_1,3,…,f_i,j,…,f_n-1,n }.

Further, the step of calculating a candidate parent node for each node in the score set includes:

K-Means clustering is applied to the grading set, wherein the K value is set to be 2, and one clustering center is fixed to be 0;

And selecting a class of scores with the clustering center not being 0 from the clustered score set, and scoring each score f _i,j in the class, wherein the record node v _y has a candidate father node v _i.

Another aspect of the invention provides a propagation network reconstruction system, the system comprising:

The acquisition module is used for acquiring historical observation data of the network node to acquire a historical observation data set;

The candidate father node calculation module is used for calculating a scoring function f (v _i,v_y) of every two nodes in the network according to the historical observation data set to obtain a scoring set, and calculating candidate father nodes of every node in the scoring set to obtain a candidate father node set C, wherein v _i and v _y are nodes;

the final parent node calculation module is used for calculating the final parent node of each node by combining the candidate parent nodes and obtaining a final parent node set F according to the final parent node of each node;

A reconstruction module, configured to generate a directed edge from a parent node to a node v _i according to each parent node of the final parent node set F corresponding to each node v _i and the node v _i, so as to obtain a propagation network topology structure to reconstruct the propagation network structure;

the final parent node calculation module comprises:

Judging whether the queue Q is an empty set or not;

Calculating a scoring function f (T, v _i);

Another aspect of the invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a propagation network reconstruction method as described above.

In another aspect, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing a propagation network reconstruction method as described above when executing the program.

Drawings

FIG. 1 is a flow chart of a method of propagation network reconstruction in an embodiment of the present invention;

The invention will be further described in the following detailed description in conjunction with the above-described figures.

Detailed Description

In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Several embodiments of the invention are presented in the figures. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

In order to solve the technical problems that in the prior art, the reconstruction of a propagation network needs to depend on the exact occurrence time of each node and needs to obtain a relatively accurate result under a large amount of sample data, the application provides a method, a system, a storage medium and equipment for reconstructing the propagation network, and the propagation network is reconstructed by reconstructing the propagation network, so that the network can be reconstructed only by the node state in the final observation, and the exact occurrence time of each node infection and the node infection state in each diffusion process are not required to be related; furthermore, the technical scheme of the application does not need to estimate the probability, but converts the probability problem into the sampling problem, and the dependence on the number of samples is eliminated, so that a more accurate result can be obtained under a small number of data samples.

The propagation network is a directed graph g= { V, E }, V representing a set of vertices (nodes), E representing a set of directed edges. Some nodes are initially "infected" with a state flag of 1, uninfected nodes with a state flag of 0, and at regular intervals, infected nodes infect neighboring nodes pointed to by their own directed edges with a certain probability p, which is called the IC propagation model of the propagation network. At the end of a period of time, the infection status of all nodes in the network is recorded, called a historical infection (observation) record.

The propagation network node history observation dataset d= { D ₁,D₂,…,D_n } has n records in total, each record represents the final diffusion (infection) result of a certain propagation network, wherein each record D _i=(d_i,1,d_i,2,…,d_i,m) is an m-dimensional vector, each component D _i,j ∈ {0,1},0 represents that node v _j is not infected in this diffusion, and 1 represents that it is infected.

In the propagation network, if there is a directed edge from node v _i to node v _j, node v _i is referred to as the parent of node v _j. The invention defines a scoring function f (C, v _i), wherein v _i represents the ith node of the propagation network, C represents the candidate parent node set of v _i, f (C, v _i) measures the matching degree of the node set C being the parent node of v _i for the node v _i, and the greater the f (C, v _i), the better the matching degree.

Wherein, the calculation formula of the scoring function f (C, v _i) is:

；

Where Tr (-) represents the trace of the matrix, J is an n-dimensional matrix, n is the number of entries of the propagation network node historical observation dataset, j=i-1/n, I represents an n-dimensional identity matrix; k _C and K _Vi are n-dimensional symmetric matrixes, the ith row and the jth column of elements of K _C are inner products of the ith record and the jth record of the nodes represented by the set C, if the set C has three nodes { v ₂,v₁₁,v₁₇ }, the state in the 5 th record is (1, 0, 1) and is marked as a vector form, the state in the 7 th record is (1, 1) and is also marked as a vector form, the two vectors are subjected to inner product calculation to obtain a result 2, and the value of the 7 th column element of the 5 th row of the matrix K _C is 2; k _Vi is defined as the same: the ith row and jth column element of K _Vi is the inner product, i.e., the product, of the ith record and the jth record of node v _i.

It should be further noted that, the parameters of the scoring function mean a set, a single node is considered to be a set of only one element, and a plurality of nodes are considered to be a set of a plurality of elements, so that the scoring function f (v _i,v_y) and the scoring function f (T, v _i) can be calculated by reasonably applying the calculation formulas of the scoring function f (C, v _i).

The invention also defines a scoring upper limit function g (C, C ', v _i), which means that for node v _i, its candidate parent node set is C, its determined parent node set is C ' E C, and no matter how a new parent node is added to C ', its score f (C ', v _i) can not exceed g (C, C ', v _i). The calculation formula of the scoring upper limit function is as follows:

；

In order to facilitate an understanding of the invention, several embodiments of the invention will be presented below. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Example 1

Referring to fig. 1, a propagation network reconstruction method according to a first embodiment of the present invention is shown, and the method includes steps S101 to S104:

S101, acquiring historical observation data of network nodes to obtain a historical observation data set.

S102, calculating a scoring function f (v _i,v_y) of every two nodes in the network according to the historical observation data set to obtain a scoring set, and calculating candidate father nodes of every node in the scoring set to obtain a candidate father node set C, wherein v _i and v _y are nodes.

As a specific example:

the scoring function f (v _i,v_y) of every two nodes in the network is calculated to obtain n (n-1)/2 scores to obtain a scoring set, and the scoring set is marked as { f _1,2,f_1,3,…,f_i,j,…,f_n-1,n }.

S103, calculating the final parent node of each node by combining the candidate parent nodes, and obtaining a final parent node set F according to the final parent node of each node.

In this embodiment, the step of calculating the final parent node of each node in combination with the candidate parent node includes:

Judging whether the queue Q is an empty set or not;

Calculating a scoring function f (T, v _i);

S104, generating a directed edge from the father node to the node v _i according to each father node of the final father node set F corresponding to each node v _i and the node v _i, thereby obtaining a propagation network topological structure to reconstruct the propagation network structure.

Taking DUNF datasets as an example, DUNF is a blog network dataset in the real world, containing 750 users, representing nodes; 2974 concerns represent directed edges between nodes, an IC propagation model is applied to the network, different initial infection node proportions (0.05,0.1,0.15,0.2,0.25) are selected, and 5×200 historical observation records are obtained through simulation respectively.

The reconstructed propagation network method of the present invention and prior art method TENDS are applied to this dataset and after all steps are completed, recall and accuracy are recorded. The recall rate represents the proportion of the edges of the propagation network which are correctly found by the method to the real edges, and the accuracy represents the correct proportion of the edges which are found by the method. For the recall and the accuracy, the recall and the accuracy are blended and averaged to obtain an F-score value which reflects the comprehensive performance of the method on the recall and the accuracy, the range is [0,1], and the higher the value is, the better the performance is. The results are shown in Table 1:

table 1:

Taking the DPU data set as an example, the DPU is a larger blog network data set that contains 1038 users and 11385 concerns. The IC propagation model is applied to the network, different initial infection node proportions (0.05,0.1,0.15,0.2,0.25) are selected, 5 multiplied by 200 historical observation records are obtained through simulation respectively, the reconstructed propagation network method and the reconstructed propagation network method TENDS in the prior art are applied to the data set, and recall rate and accuracy rate are recorded after all steps are completed. The results are shown in Table 2:

Table 2:

In summary, in the propagation network reconstruction method in the above embodiment of the present application, by reconstructing the propagation network, the network can be reconstructed only by the node state at the time of final observation, and the exact occurrence time of each node infection and the infection state of the node in each diffusion process are not required to be involved; furthermore, the technical scheme of the application does not need to estimate the probability, but converts the probability problem into the sampling problem, and the dependence on the number of samples is eliminated, so that a more accurate result can be obtained under a small number of data samples; the method solves the technical problem that the reconstruction of the propagation network in the prior art needs to depend on the exact occurrence time of each node and can obtain more accurate results under a large amount of sample data.

Example two

The propagation network reconstruction system provided in the second embodiment of the present invention includes:

A reconstruction module, configured to generate a directed edge from a parent node to a node v _i according to each parent node of the final parent node set F corresponding to each node v _i and node v _i, so as to obtain a propagation network topology structure to reconstruct the propagation network structure;

the final parent node calculation module comprises:

Judging whether the queue Q is an empty set or not;

Calculating a scoring function f (T, v _i);

In summary, in the propagation network reconstruction system in the above embodiment of the present application, by reconstructing the propagation network, the network can be reconstructed only by the node state at the time of final observation, and the exact occurrence time of each node infection and the infection state of the node in each diffusion process are not required to be involved; furthermore, the technical scheme of the application does not need to estimate the probability, but converts the probability problem into the sampling problem, and the dependence on the number of samples is eliminated, so that a more accurate result can be obtained under a small number of data samples; the method solves the technical problem that the reconstruction of the propagation network in the prior art needs to depend on the exact occurrence time of each node and can obtain more accurate results under a large amount of sample data.

Furthermore, an embodiment of the present invention proposes a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the steps of the method in the above-mentioned embodiment.

Furthermore, an embodiment of the present invention also proposes a data processing apparatus including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method in the above embodiment when executing the program.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A method of propagation network reconstruction, comprising:

Judging whether the queue Q is an empty set or not;

Calculating a scoring function f (T, v _i);

2. The propagation network reconstruction method according to claim 1, wherein the calculation formula of the scoring function f (C, v _i) is:

；

3. The propagation network reconstruction method according to claim 2, wherein the calculation formula of the upper score limit function is:

；

4. The method of claim 1, wherein calculating a scoring function f (v _i,v_y) for each node in the network from the historical observation dataset to obtain a set of scores comprises: the scoring function f (v _i,v_y) of every two nodes in the network is calculated to obtain n (n-1)/2 scores to obtain a scoring set, and the scoring set is marked as { f _1,2,f_1,3,…,f_i,j,…,f_n-1,n }.

5. The method of propagation network reconstruction of claim 4, wherein the step of calculating candidate parent nodes for each node in the score set comprises:

6. A propagation network reconstruction system, the system comprising:

the final parent node calculation module comprises:

Judging whether the queue Q is an empty set or not;

Calculating a scoring function f (T, v _i);

7. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a propagation network reconstruction method as claimed in any one of claims 1-5.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the propagation network reconstruction method according to any one of claims 1-5 when the program is executed by the processor.