CN109474756B

CN109474756B - Telecommunication anomaly detection method based on collaborative network representation learning

Info

Publication number: CN109474756B
Application number: CN201811367747.5A
Authority: CN
Inventors: 张震; 刘冠男; 张亮; 林浩; 吴俊杰; 马欢; 缪亚男
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2018-11-16
Filing date: 2018-11-16
Publication date: 2020-09-22
Anticipated expiration: 2038-11-16
Also published as: CN109474756A

Abstract

The invention discloses a telecommunication anomaly detection method based on collaborative network representation learning, and belongs to the field of data mining and machine learning. Firstly, training an xgboost classifier, and testing the fraud class probability of each CDR data to form a signaling data set to be detected. And extracting a two-part graph P formed by the calling and the called users, selecting suspected fraudulent calling nodes from the signaling data set according to the scores to generate a seed node set Z, and adding any two calling which have a common called neighbor to a collaborative network set G. Expanding a called node set B to be selected by the Union bipartite graph P, removing called users who do not meet the conditions, and updating the reserved called nodes into a set B'; and expanding and updating the seed node set Z ', de-superposing and updating the cooperative network G', reducing the dimension to obtain embedded vectors for modeling and prediction, and taking N with the maximum abnormal scores as detection results to output. The invention ensures the quality of the generated cooperative network, improves the calculation speed and can adapt to different data characteristics.

Description

Telecommunication anomaly detection method based on collaborative network representation learning

Technical Field

The invention belongs to the fields of data mining, machine learning, business intelligence and the like, relates to the problem of anomaly detection in large-scale and sparse data, and particularly relates to a telecommunication anomaly detection method based on collaborative network representation learning.

Background

In recent years, telecommunication fraud cases of China occur frequently, and the property safety and social stability of people are seriously threatened. Therefore, how to realize accurate identification of the telecom fraud object by using methods such as classification and anomaly detection in data mining has important practical significance to the supervision department and the whole society.

Traditional methods mainly focus on modeling single fraud object behaviors, such as classifying signaling data, and the like, and the local modeling view severely limits the representation capability of the model. Telecommunication fraud behaviors are very large in group and cooperativity, and a situation that a plurality of fraudulent callers call the same called party together often occurs, while few common neighbors exist among normal callers. The cooperativity can well distinguish normal callers and fraud callers, can remarkably improve the recognition effect of the model on fraud, but the past method is rarely considered.

In fact, this phenomenon is widely present in practical application scenarios, such as network intrusion detection, water army identification in social networks, and the like. There is therefore a need for an anomaly detection method that can simultaneously fuse individual behavior and group cooperativity.

Disclosure of Invention

The invention provides a telecommunication anomaly detection method based on collaborative network representation learning, which considers the difficulty and the challenge of telecommunication fraud anomaly detection in large-scale sparse data and simultaneously considers the group and the cooperativity of fraud behaviors.

The method comprises the following specific steps:

step one, collecting CDR data of a telecommunication network in a period of time, manually selecting partial CDR data with a fraud tag, and training an xgboost classifier for predicting fraud at a signaling level by combining a static attribute of a calling terminal call record.

The static attributes comprise the characteristics of call duration, called place, calling place, call waiting duration and the like;

step two, inputting each CDR data into the xgboost classifier respectively to obtain the probability of each CDR data belonging to the fraud category, taking the probability value as the fraud scoring attribute of the CDR data, and forming a signaling data set to be detected by all CDR data sets with the scoring attribute;

step three, extracting calling and called users in each call record according to a signaling data set to be detected, generating a non-directional edge between corresponding calling and called nodes, and forming a communication bipartite graph P by the edges between all the calling and called users;

selecting K calling parties with the highest average score from the signaling data set as suspected fraudulent calling party nodes according to the fraud score obtained by the xgboost classifier, generating a seed node set Z, and initializing an empty set to be recorded as a calling party cooperative network G;

step five, analyzing the calling nodes in the current calling cooperative network G and the current seed node set Z, if the same called user is dialed between any two calling users at the same time, namely common called neighbors exist on the bipartite graph, establishing a non-directional edge between the two calling nodes, and updating the cooperative network G by taking the number of the neighbors as the weight of the edge;

updating the collaborative network G means that two calling nodes, edges and weights of the common called neighbors are put into the current calling collaborative network G.

And C, the initial value of the current calling collaborative network G is null, and the initial value of the current seed node set Z is the K calling with the highest average score selected in the step four.

Step six, starting from the current seed node set Z, searching a called node which has a conversation relationship with a calling party in the seed node set Z on the connected bipartite graph P, and expanding a called node set B to be selected;

step seven, selecting the called party from the called node set B to be selected, wherein the number of the received calls of a called party is less than J1 or the average CDR fraud score is less than J2, removing the called party from the called node set B to be selected, and updating the reserved called node into a set B';

j1 is the first percentile of the number of calls received by all called users;

j2 is the first percentile of the average CDR fraud scores for all called subscribers;

step eight, starting from the updated called node set B ', searching a calling node which has a communication relation with the called of the set B ' on the communication bipartite graph P, expanding the calling node into a new suspected fraud calling, and updating the new suspected fraud calling to a seed node set Z ';

step nine, returning to the step five, performing de-coincidence on the calling nodes in the current calling collaborative network G and the updated seed node set Z ', and updating the collaborative network G into a collaborative network G';

step ten, calculating the proportion of the newly added calling in the updated cooperative network G' to the calling in the original cooperative network G in the current iteration;

step eleven, judging whether the proportion is smaller than a manually set threshold value, if so, determining that the cooperative network G is convergent, and executing step twelve; otherwise, the updated cooperative network G 'and the updated seed node set Z' are used as the current set, and the step six is returned to be executed;

step twelve, using a network representation learning method to perform dimensionality reduction representation on the converged calling cooperative network, and obtaining a D-dimensional embedded vector representation by each calling node;

and thirteen, using an unsupervised anomaly detection model OneClass SVM to model and predict the embedded vector representation of each calling node, obtaining the anomaly scores of each node, sequencing the abnormal scores from large to small, and taking N with the maximum anomaly scores as the final detection result of the model to output.

The invention has the advantages that:

1) the telecommunication fraud subject detection method based on collaborative network representation learning uses the abnormal scores of the classification model at the signaling level as the recall strategy of the seed node set, and can integrate the advantages of the traditional individual behavior modeling. Meanwhile, the credibility of the seed nodes is improved, and the quality of the generated cooperative network is ensured.

2) The telecommunication fraud subject detection method based on collaborative network representation learning uses a customized pruning strategy, such as a percentile statistical strategy, to filter a called user set B. The network generation process is not only beneficial to cache memory occupation and improvement of calculation speed, but also can adapt to different data characteristics, customize different pruning strategies and control the size and final form of the network.

3) The telecommunication fraud main body detection method based on the collaborative network representation learning is characterized in that the network representation learning method is used for carrying out dimensionality reduction on the calling collaborative network and carrying out anomaly detection, and the anomaly degree of a fraud calling party is depicted from the perspective of population and cooperativity by using an embedded vector mode.

Drawings

FIG. 1 is a schematic diagram of a telecommunication anomaly detection method based on collaborative network representation learning according to the present invention;

FIG. 2 is a flow chart of a telecommunication anomaly detection method based on collaborative network representation learning according to an embodiment of the present invention;

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The invention relates to a telecommunication anomaly detection method based on collaborative network representation learning, which is characterized in that as shown in figure 1, original streaming signaling data is input, an xgboost fraud classifier at the signaling level is trained, the signaling data is scored by using the xgboost classifier, K calling nodes most suspected to be fraud are screened out according to the signaling score and are stored in a signaling database containing the anomaly score; generating seed node sets Z by the K most suspected fraudulent calling nodes, and initializing an empty set to be recorded as a calling cooperative network G; if a common called neighbor exists between the calling parties, establishing an edge between the two calling nodes, updating a calling collaborative network G, starting from a seed node set Z, and expanding the called nodes on a Unicom bipartite graph P; according to different user-defined pruning strategies such as black and white list screening, statistical rule screening and signaling level fraud scoring screening, expanding on a communication bipartite graph from a core called node to generate a new suspected fraud calling node, finally, iteratively converging, using a network representation learning method to reduce the dimension of a converged calling collaborative network, using an anomaly detection model OneClass SVM to model embedding of the calling node, and outputting the anomaly score of each node.

As shown in fig. 2, the specific steps are as follows:

step one, collecting CDR data of a telecommunication network within a period of time such as a week, manually selecting partial CDR data with a fraud tag, and training an xgboost classifier for predicting fraud at a signaling level by combining a static attribute recorded by a calling terminal call.

step two, inputting each CDR data into the xgboost classifier respectively to obtain the probability that each CDR data belongs to the fraud category, taking the probability value as the fraud scoring attribute of the CDR data, forming a signaling data set to be detected by all CDR data sets with the scoring attribute, and storing the signaling data set into an Elasticissearch database;

extracting a communication bipartite graph P between a calling party and a called party by using networkx from an elastic search database on the same day;

setting the number of the seed nodes to 10000, recalling 10000 suspected fraudulent calling users with highest signaling classification scores from the signaling data set according to a customized recall strategy, generating a seed node set Z, and initializing a calling collaborative network G.

step seven, selecting the called party from the called node set B to be selected, wherein the number of the received calls of a called party is less than J1 or the average CDR fraud score of all the called parties is less than J2, removing the called party from the called node set B to be selected, and updating the reserved called node into a set B';

j1 is the first percentile of the number of calls received by all called users;

filtering a called node set B to be selected by using a self-defined pruning strategy, and updating the called node set to be B';

pruning and filtering a set B of called nodes to be selected, removing called nodes positioned in a white list, removing called nodes with the number of times of receiving calls less than 3 times, removing called nodes with the signaling classification score lower than 0.4, and updating the set B' of the called nodes;

step eight, starting from the updated called node set B ', searching a calling node which has a communication relation with a called in the set B ' on the communication bipartite graph P, expanding the calling node into a new suspected fraud calling, and updating the new suspected fraud calling to a seed node set Z ';

the threshold value is set to 0.0001 in this embodiment. And if the number of the newly added nodes accounts for less than 0.001 of the total node ratio of the cooperative network G, the cooperative network G is considered to be converged.

dimension reduction is carried out on the calling collaborative network by using the deepwalk, and each calling node can obtain a 128-dimensional embedded vector representation;

and thirteen, using an unsupervised anomaly detection model One-Class SVM to model and predict the embedded vector representation of each calling node, obtaining the anomaly scores of each node, sequencing the anomaly scores from large to small, and taking N nodes with the maximum anomaly scores as the final detection results of the model to output.

The present embodiment selects TOP 5000 callers as the final test result of the model.

The invention describes the abnormal degree of the fraud caller from the perspective of individual behaviors by classifying the original CDR at the signaling level. Meanwhile, the abnormal score of the signaling level can be used as a recall strategy for forming a suspected fraud caller candidate set. When a common called neighbor exists between the calling parties, edges are established between the calling nodes, and the number of the neighbors is used as the weight of the edges, so that the collaborative network of the suspected fraud calling party is generated. In order to relieve the problem of large-scale data training, a cooperative network is grown in an iterative updating mode. Specifically, a seed node set of suspected fraudulent callers is initialized according to a recall policy, and corresponding called nodes are expanded on the connected bipartite graph from the seed nodes. And filtering useless called nodes according to a customized pruning strategy to generate a core called node set, and expanding a new suspected fraud calling set on the Unicom bipartite graph from the set. The above growth process is repeated until the cooperative network tends to converge, so that the network generation process is not only beneficial to cache memory occupation and improvement of calculation speed, but also can self-define different pruning strategies according to data characteristics and control the size and final form of the network. After the cooperative network is generated, dimension reduction and abnormality detection are carried out on the cooperative network through a network representation learning method, and the abnormal degree of the fraud caller is described from the perspective of population and cooperativity by using an embedded vector mode.

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. A telecommunication anomaly detection method based on collaborative network representation learning is characterized by comprising the following specific steps:

step one, collecting CDR data of a telecommunication network in a period of time, manually selecting partial CDR data with a fraud tag, and training an xgboost classifier for predicting fraud at a signaling level by combining a static attribute of a calling end call record;

step five, analyzing the calling nodes in the current calling cooperative network G and the current seed node set Z, if the same called user is dialed between any two calling users at the same time, namely common called neighbors exist on the bipartite graph, establishing a non-directional edge between the two calling nodes, and updating the cooperative network G by taking the number of the neighbors as the weight of the edge; the updating of the cooperative network G means that two calling nodes with a common called neighbor, edges and weights are put into the current calling cooperative network G; the initial value of the current calling collaborative network G is empty, and the initial value of the current seed node set Z is the K calling with the highest average score selected in the step four;

j1 is the first percentile of the number of calls received by all called users;

2. The method as claimed in claim 1, wherein the static attributes of step one include call duration, called location, calling location and call waiting duration.