CN112347373A

CN112347373A - Role recommendation method based on open source software mail network

Info

Publication number: CN112347373A
Application number: CN202011265544.2A
Authority: CN
Inventors: 宣琦; 谢昀苡; 张剑
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-02-09
Anticipated expiration: 2040-11-13
Also published as: CN112347373B

Abstract

The invention provides a role recommendation method based on an open source software mail network, which comprises the following steps: s1: constructing an undirected authorized network according to mail data of an open source software project; s2: randomly deleting part of the continuous edges of the network constructed in the S1 to be used as test samples, using the residual continuous edges in the network after deleting the continuous edges as training samples and constructing a dynamic sequence slicing network; s3: generating the characteristics of each node by adopting a time sequence biased walking algorithm on a dynamic sequence slicing network, and then obtaining the characteristics of a connecting edge by averaging the characteristics of every two nodes; s4: and learning the training samples by adopting a logistic regression classifier, and predicting the test samples. The invention can effectively recommend the role in the open source software project, and compared with an algorithm which does not consider the time sequence information and the role information of the mail in the open source software project, the accuracy of the recommendation is obviously improved.

Description

Role recommendation method based on open source software mail network

Technical Field

The invention relates to the field of link prediction in a complex network, in particular to a role recommendation method based on an open source software mail network.

Background

The rapid development of open source software has become very prominent in the past few years. Attract a large number of users to join the open source software community. Active participation by developers and users is critical to the success of the open source software project. To promote the sustainable development of open source software projects, developers need to maintain project code. Also, it is vital to motivate, attract and retain users and developers. However, most of the previous research has focused on project code maintenance, and has neglected the importance of users in the development of open source software projects. To preserve the quality of project code, there are many code repository-based methods for generating lists of developers recommending top-ranked developers to help perform code changes. It is not difficult to imagine that the recommended developers can maintain the stability of the project code. Developers contribute to the sustainable development of the project, but at the same time must also be concerned with users using the software. Because they provide feedback to developers, maintain the development of open source software projects, and they are also potential developers, meaning that they may contribute to open source software by submitting code on a day.

The participation of users and developers in open-source software projects requires overcoming a number of obstacles that hinder their further contribution to the open-source software project. Since mail is a public communication channel in the open source software community, users and developers often interact in projects in this way, i.e., people who lack understanding and guidance often post problems, request help or resolve confusion using existing information in the mail list. However, access is not easy due to the large amount of information. And the received responses provide no guidance or unprocessed responses may result in their failure to obtain useful assistance. The obstacles faced by users and developers will cause them to forgo further contributions to the open source software project. It is therefore possible to recommend some experienced people for the developers and users who are mainly helped to avoid this.

The recommendation method for the reviewers of the Pull Request in the open source software development disclosed in the Chinese patent publication with the application number of CN202010338549.7 considers four factors of interest correlation, liveness, social relationship influence degree and file path correlation of the reviewers and the content of the Pull Request, and carries out personalized weighting on the four factors by a Bayesian personalized sorting method, so that the suitable code reviewers are recommended for the Pull Request, and the recommendation method is based on the manual feature extraction of the developers in the open source software. The patent focuses more on mail information of the open source software project rather than a code repository, and the consideration range is wider, not only the developers in the open source software are considered, but also the users using the open source software are concerned. In addition, the method models the mail data of the open source software project from the network level, and considers the embeddability of nodes in the network, so that more important interaction between users and developers in the open source software project can be found, and role recommendation for participants needing help in the open source software is facilitated.

There is very little literature involved in role recommendation work specific to open source software. Canfora et al propose an unsupervised approach based on open source software by mining data from mailing lists and code repositories for open source software projects and making role recommendations. They focus on the code repository of the open source software project and calculate the score between the developer and the user so that the user and the developer can recommend appropriate personnel to help them. However, this is merely an empirical study and is not a universally applicable approach.

The current popular method is to model the data into the form of network, and convert the nodes in the network into low-dimensional vector representation (the vectors represent the characteristics of the network nodes) by the graph embedding method, and convert the role recommendation problem into the link prediction task in machine learning. The Node2vec method proposed by Grover is a very easy-to-apply walking method, combines depth-first walking and breadth-first walking, and represents nodes in a network by using low-dimensional vectors, so that the network structure characteristics of the nodes are extracted, and role recommendation can be performed more accurately.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a role recommendation method based on an open source software mail network, which can help project participants (users and developers) needing help by recommending the participants to the projects in an open source software project, thereby being beneficial to the sustainable development of the open source software project.

The invention researches the recommendation of developers and users to participants needing the help of an open source software community. These recommendations can provide some support to participants when they encounter difficulties, which is critical to the sustainable development of the open source software project. Further, the present invention models the mail data in open source software as a dynamic sequence slicing network, which is a new temporal network to capture the evolution of the interaction between users and developers. In addition, a time sequence biased walking algorithm based on interaction is also provided, the algorithm integrates the time information, the structure information and the identity information of participants of the open source software mail network, and effectively uses an embedded algorithm to represent developers and users for role recommendation.

In order to achieve the purpose, the invention provides the following scheme:

the invention provides a role recommendation method based on an open source software mail network, which is characterized by comprising the following steps of:

s1: constructing an undirected authorized network according to mail data of an open source software project;

s2: connecting edges of randomly deleted parts of the network constructed in the S1 are used as test samples, the remaining connecting edges in the network after the connecting edges are deleted are used as training samples, and a dynamic sequence slicing network G' is constructed;

s3: generating the characteristics of each node on the dynamic sequence slicing network G' by adopting a time sequence biased walking algorithm, and then obtaining the characteristics of a connecting edge by averaging the characteristics of every two nodes;

s4: and learning the training samples by adopting a logistic regression classifier, and predicting the test samples.

Preferably, in the undirected authorized network constructed in the step S1:

the roles in the mail data represent nodes in the network, the mail interaction between the roles represents the connecting edges of two nodes in the network, and the number of the mail interaction represents the weight of the connecting edges in the network;

the undirected weighted network is represented by G (V, E, W), wherein V represents n nodes in the network, E represents a continuous edge set of the nodes, W is a weight matrix of the continuous edges, and W is_ijIs an element of the matrix W, the W_ijRepresenting the weights of node i and node j, i.e., the number of exchanges of mail between the two nodes.

Preferably, the specific steps of constructing the dynamic sequence slice network G' in step S2 are as follows:

the undirected weighted network G is divided according to given time intervals, and is divided into a plurality of subgraphs { G ] by taking one month as a time interval₁，G₂，G₃，...G_i,., numbering, arranging each subgraph in ascending order according to time number, and connecting the same nodes in the adjacent subgraphs in sequence.

Further, each continuous edge in the dynamic sequence slicing network G' in S2 is represented by e ═ u (u, v, w, t), where u is a starting node of the continuous edge, i.e., src (e) ═ u, v is an ending node of the continuous edge, i.e., dst (e) ═ v, w is a weight of the continuous edge, i.e., w (e) ═ w, and t is a temporal reachability t (e) ═ t of the continuous edge.

Preferably, the timing biased walking algorithm in S3 is a second-order neighbor sampling strategy for selecting a reachable edge to generate an edge sequence, where the strategy includes static edge weight information and a structure transition probability P_STiming transition probability P_TAnd a role-based transition probability P_RThe time sequence biased walking algorithm specifically comprises the following steps:

step 1, setting the maximum wandering times and the wandering length;

step 2, randomly selecting any node in the dynamic sequence slice network G' as an initial node;

step 3, carrying out wandering according to the calculated transition probability P (e), thereby obtaining a series of wandering sequences;

step 4, applying a Skip-Gram model in natural language processing to the walking sequence to obtain node characteristics;

and 5, obtaining the characteristics of the connected edges by averaging the characteristics of every two nodes.

Further, the reachable connection edge is defined as:

for subgraph G_iNode u in (1), defines: η (u) ═ i, then the temporal reachability of the edges can be defined as: t (e) · η (v) - η (u) ∈ { -1, 0, 1}, where u is a start node of a connected edge, v is a termination node of the connected edge, and for the dynamic sequence slicing network G', a reachable connected edge set of the defined node v is L_t(v) Where "e | src (e) ≧ v, t (e) ≧ 0", that is, the start node of the connected edge is v and the time reachability of the connected edge is required to be 0 or more.

Further, the structure transition probability P_SThe calculation method comprises the following steps:

if the current wandering stays at the node c, the last wandering node is t, and e belongs to L for any reachable connecting edge_t(c) Dst (e) ═ x, structure transition probability P_SComprises the following steps:

P_S(e)＝ψ_S(e)·W(e)

wherein d is_txE {0, 1, 2} represents the shortest distance, ψ, between node t and node x_S(e) The method comprises the steps of searching for the structure deviation of a connecting edge e, returning a parameter r and an access parameter q, wherein the parameter q and the parameter r jointly determine the searching direction of the connecting edge and also control the speed of exploration and departure from the neighborhood of an initial vertex during walking, and W (e) is the weight of the connecting edge e.

Further, the timing transition probability P_TThe calculation method comprises the following steps:

wherein psi_T(e) Is the timing search deviation of the connecting edge e, alpha is a timing deviation parameter, and the parameter alpha is more than or equal to 0.1 and less than or equal to 0.9 determines whether the wandering stays in the current sub-graph: when alpha is smaller, the wandering time is more inclined to stay in the current sub-graph; when alpha is larger, the walking time is more prone to be transferred to the next subgraph, and e' belongs to the reachable edge set L of the node v_t(c) One side of, psi_T(e ') represents a timing search bias of the continuous edge e'.

Further, the role-based transition probability P_RDividing into unbiased transfer and biased transfer, if the current wandering stays at the node c, the last wandering node is t, and the random reachable connecting edge e belongs to L_t(c) Dst (e) ═ x, transition probability P based on character_RComprises the following steps:

the specific calculation method comprises the following steps:

1) no deflection shift:

no deflection shift means that each reachable edge has equal probability of being selected;

2) the deflection movement is as follows:

where ω (x) represents the true identity of node x, e.g. user or developer,. psi_R(e) The role search deviation of a connecting edge e is included, beta is a role deviation parameter, a parameter beta is more than or equal to 0.1 and less than or equal to 0.9, whether the wandering tends to be towards the same type or different types of nodes is determined, and the parameter beta controls the communication tendency of the nodes: when beta is larger, the wandering direction is more inclined to wander towards the same type of node; when β is smaller, the direction of wandering is more inclined to wander toward different classes of nodes, e-tableShowing the side, psi, of the immediately next transfer_T(e) Representing the role search bias of the connected edge e, e' belonging to the reachable connected edge set L of the node v_t(c) One side of, psi_T(e ') represents a role search bias of the connected edge e'.

Further, the transition probability p (e) is calculated by:

transferring the above time sequence to probability P_STiming transition probability P_TAnd a role-based transition probability P_RThe final transition probabilities are obtained by respective normalization as follows:

P(e)＝P_S(e)P_T(e)P_R(e)

the invention has the advantages that: the time sequence information of the mail data in the open source software project is fully utilized, and the mail data is modeled into a dynamic sequence slicing network. The dynamic sequence slice network can reflect the evolution process of the network structure and is more suitable for dynamic data sets than a common static network. Secondly, on the basis of the dynamic sequence slicing network, a time sequence biased walking algorithm is provided, and the algorithm makes full use of the topological characteristics, the time sequence information and the identity information of project participants of the mail network. Compared with the prior art, the role recommendation method can effectively recommend roles in the open source software project, and compared with an algorithm which does not consider the time sequence information and the role information of the mails in the open source software project, the recommendation accuracy is obviously improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of a dynamic sequence slicing network G' of the present invention;

fig. 2 is a flow chart of the present invention.

Detailed Description

Reference will now be made in detail to various exemplary embodiments of the invention, the detailed description should not be construed as limiting the invention but as a more detailed description of certain aspects, features and embodiments of the invention.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Further, for numerical ranges in this disclosure, it is understood that each intervening value, between the upper and lower limit of that range, is also specifically disclosed. Every smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in a stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All documents mentioned in this specification are incorporated by reference herein for the purpose of disclosing and describing the methods and/or materials associated with the documents. In case of conflict with any incorporated document, the present specification will control.

It will be apparent to those skilled in the art that various modifications and variations can be made in the specific embodiments of the present disclosure without departing from the scope or spirit of the disclosure. Other embodiments will be apparent to those skilled in the art from consideration of the specification. The specification and examples are exemplary only.

As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.

The "parts" in the present invention are all parts by mass unless otherwise specified.

Example 1

The technical scheme provides the definition of a dynamic sequence slicing network, the definition of reachable edges and a time sequence biased walking algorithm specially for an open source software project, wherein the structure transfer probability is consistent with a Node2vec algorithm, and the main innovation point of the algorithm is the time sequence transfer probability and the transfer probability based on roles.

The invention provides a role recommendation method based on an open source software mail network, which comprises the following steps:

Further, in step S1, the roles in the mail data represent nodes in the network, the mail interactions between the roles represent edges between two nodes in the network, and the number of mail interactions represents the weight of the edges in the network.

The undirected weighted network is represented by G (V, E, W), wherein V represents n nodes in the network, E represents a set of edges connecting the nodes, W is a weight matrix of the edges, and W is_ijIs an element of the matrix W, the W_ijRepresenting the weights of node i and node j, i.e., the number of exchanges of mail between the two nodes.

In step S1, a certain proportion of continuous edges in the original network are concealed as test samples, continuous edges in the remaining network are used as training samples, and the continuous edges in the remaining network are constructed into a dynamic sequence slice network G'. Mail data in the open source software project is provided with time information, so that the undirected and authorized network G can be divided according to given time intervals, and the undirected and authorized network G is divided into a plurality of subgraphs { G ] by taking one month as a time interval₁，G₂，G₃，...G_i,., numbering, arranging each subgraph in ascending order according to time number, and connecting the same nodes in the adjacent subgraphs in sequence. Fig. 1 is an example of a dynamic sequence slicing network G'.

Further, in step S2, for each continuous edge in the dynamic sequence slicing network G', the value is represented by e ═ u (u, v, w, t), where u is src (e) u, which is the starting node of the continuous edge, v is dst (e) v, w is the weight of the continuous edge, w (e) w, and t is the temporal reachability t (e) t of the continuous edge.

Further, in step S3, the timing biased walk algorithm is further designed based on the above definition. The time sequence biased walking algorithm is a second-order neighbor sampling strategy and is used for selecting the reachable continuous edge so as to generate a continuous edge sequence. The strategy comprises static continuous edge weight information and structure transition probability P_STiming transition probability P_TAnd a transition probability PR based on the role, wherein the time sequence biased walk algorithm comprises the following specific steps:

step 1, setting the maximum wandering times and the wandering length;

Further, the reachable connection edge is defined as:

for subgraph G_iNode u in (1), defines: η (u) ═ i. Then the temporal reachability of the edges can be defined as: t (e) · η (v) - η (u) ∈ { -1, 0, 1}, where u is the starting node of the connected edge and v is the terminating node of the connected edge. Further, for the dynamic sequence slicing network G', the reachable edge set of the node v may be defined as follows: l is_t(v) Where v is the starting node of the connected edge, and the time accessibility of the connected edge is required to be largeEqual to 0.

if the current wandering stays at the node c, the last wandering node is t, and e belongs to L for any reachable connecting edge_t(c) And dst (e) x. Probability of structure transfer P_SThe probability is:

P_S(e)＝ψ_S(e)·W(e)

wherein psi_T(e) Is the timing search deviation of the connecting edge e, alpha is a timing deviation parameter, and alpha is more than or equal to 0.1 and less than or equal to 0.9, which determines the time search direction: whether residing on the current sub-graph or moving to the next sub-graph. If alpha is small, the wandering is more inclined to stay in the current sub-graph, otherwise the wandering is more inclined to the edge appearing in the future sub-graph, and e' belongs to the reachable edge set L of the node v_t(c) One side of, psi_T(e ') represents a timing search bias of the continuous edge e'. The timing transition probability is helpful for exploring the change of node interaction in different time periods in the network evolution process.

Further, a role-based transition probability P_R: there can be a classification into unbiased transfer and biased transfer. There are two types of roles in open source software: users and developers. Unbiased transitions are employed when the true identity of the character is unknown, and biased transitions are employed if the true identity of the character is known. Experimental results with offset shift tend to be better than time results without offset shift.

The unbiased transfer is:

no deflection shift means that every reachable edge has equal probability of being selected, L_t(c) Each edge e in (a) has the same probability of being sampled.

The deflection movement is as follows:

L_t(c) each edge e in (a) needs to consider information about dst (e) ═ x in the connected edge e, that is, the real identity of the node x, where ω (x) represents the real identity of the node x (e.g., a user or a developer). Psi_R(e) The method is characterized in that the character search deviation of a continuous edge e is included, beta is a character deviation parameter, a parameter beta is more than or equal to 0.1 and less than or equal to 0.9, whether the wandering tends to be towards nodes of the same type or different types or not is determined, the parameter beta controls the communication tendency of the nodes, if the beta is larger, the wandering is more likely to traverse the nodes of the same type as the initial node, otherwise the wandering encourages the exploration of the nodes of different types, e represents a continuous edge just transferred next time, and e' belongs to a reachable continuous edge set L of the node v_t(c) One side of, psi_T(e ') represents a role search bias of the connected edge e'.

Further, the transition probabilities are finally normalized respectively, and the final transition probabilities are obtained as follows:

P(e)＝P_S(e)P_T(e)P_R(e)

further, in step S4, the logistic regression classifier is used to learn the data in the training samples, and then the test data is predicted. Fig. 2 gives a general flow chart.

The invention uses mail data in an open source software Project to recommend roles, and a table 1 is main data information of the open source software Project, including projects, Users, Developers, Email exchanges, timespan (month) and other projects, and collects the information to perform a test.

TABLE 1

The method is characterized in that four algorithms including Line, Deepwalk, Node2vec, time sequence biased walk and the like are used for carrying out experiments, AUC is used as various algorithm recommendation results of evaluation indexes, the algorithm with a better recommendation effect has a larger AUC value, and the AUC value of the algorithm is optimal as seen in the table 2.

TABLE 2

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A role recommendation method based on an open source software mail network is characterized in that: the method comprises the following steps:

2. The role recommendation method based on the open source software mail network according to claim 1, characterized in that: in the undirected authorized network constructed in the step S1:

3. The role recommendation method based on the open source software mail network according to claim 1, characterized in that: the specific steps of constructing the dynamic sequence slice network G' in step S2 are as follows:

4. The role recommendation method based on the open source software mail network according to claim 3, characterized in that: each continuous edge in the dynamic sequence slicing network G' is denoted by e ═ u, v, w, t, where u is a starting node of the continuous edge, namely src (e) ═ u, v is an ending node of the continuous edge, namely dst (e) ═ v, w is a weight of the continuous edge, namely w (e) ═ w, and t denotes a temporal reachability of the continuous edge, t (e) ═ t.

5. The role recommendation method based on the open source software mail network according to claim 1, characterized in that: the time sequence biased walking algorithm in S3 is a second-order neighbor sampling strategy for selecting a reachable edge to generate an edge sequence, where the strategy includes static edge weight information and a structure transition probability P_STiming transition probability P_TAnd a role-based transition probability P_RThe time sequence biased walking algorithm specifically comprises the following steps:

step 1, setting the maximum wandering times and the wandering length;

6. The role recommendation method based on the open source software mail network according to claim 5, wherein the reachable edges are defined as follows:

for subgraph G_iNode u in (1), defines: η (u) ═ i, then the temporal reachability of the edges can be defined as: t (e) · η (v) - η (u) ∈ { -1, 0, 1}, where u is a start node of a connected edge and v is a termination node of the connected edge, and for the dynamic sequence slicing network G', a section is definedThe reachable set of edges for point v is L_t(v) Where "e | src (e) ≧ v, t (e) ≧ 0", that is, the start node of the connected edge is v and the time reachability of the connected edge is required to be 0 or more.

7. The role recommendation method based on the open source software mail network according to claim 5, characterized in that: the structure transition probability P_SThe calculation method comprises the following steps:

P_S(e)＝ψ_S(e)·W(e)

8. The role recommendation method based on the open source software mail network according to claim 5, characterized in that: the timing transition probability P_TThe calculation method comprises the following steps:

wherein psi_T(e) Is the timing search deviation of the connecting edge e, alpha is the timing deviation parameter, and the parameter alpha is more than or equal to 0.1 and less than or equal to 0.9 determines the wanderingWhether to stay in the current sub-graph: when alpha is smaller, the wandering time is more inclined to stay in the current sub-graph; when alpha is larger, the walking time is more prone to be transferred to the next subgraph, and e' belongs to the reachable edge set L of the node v_t(c) One side of, psi_T(e ') represents a timing search bias of the continuous edge e'.

9. The method as claimed in claim 5, wherein the transition probability P is based on the role_RDividing into unbiased transfer and biased transfer, if the current wandering stays at the node c, the last wandering node is t, and the random reachable connecting edge e belongs to L_t(c) Dst (e) ═ x, transition probability P based on character_RComprises the following steps:

the specific calculation method comprises the following steps:

1) no deflection shift:

2) the deflection movement is as follows:

where ω (x) represents the true identity of node x, e.g. user or developer,. psi_R(e) The role search deviation of a connecting edge e is included, beta is a role deviation parameter, a parameter beta is more than or equal to 0.1 and less than or equal to 0.9, whether the wandering tends to be towards the same type or different types of nodes is determined, and the parameter beta controls the communication tendency of the nodes: when beta is larger, the wandering direction is more inclined to wander towards the same type of node; when β is smaller, the direction of walking is more inclined towardWandering along nodes of different types, e representing the continuous edge of the next transition, psi_T(e) Representing the role search bias of the connected edge e, e' belonging to the reachable connected edge set L of the node v_t(c) One side of, psi_T(e ') represents a role search bias of the connected edge e'.

10. The role recommendation method based on the open source software mail network as claimed in claim 5, wherein the transition probability P (e) is calculated by:

P(e)＝P_S(e)P_T(e)P_R(e)。