CN114331473A

CN114331473A - Method and device for identifying telecommunication fraud event and computer-readable storage medium

Info

Publication number: CN114331473A
Application number: CN202111645526.1A
Authority: CN
Inventors: 柳清译; 孙从阳; 徐德华; 胡炳慈; 邹璐; 王艳
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-12

Abstract

The invention discloses a method and a device for identifying telecommunication fraud events and a computer-readable storage medium. Wherein, the method comprises the following steps: obtaining target telecommunications data; analyzing the target telecommunication data to obtain a target characteristic sequence in the target telecommunication data; determining a telecommunication data recognition result corresponding to the target characteristic sequence through an information prediction model, wherein the information prediction model is obtained by using a plurality of groups of training data through machine learning training, and each group of training data in the plurality of groups of training data comprises: the method comprises the steps that a historical characteristic sequence and a historical telecommunication data identification result corresponding to the historical characteristic sequence are obtained, and the historical characteristic sequence is obtained by analyzing historical telecommunication data; determining whether the target telecommunication data has a fraud risk based on the telecommunication data identification result. The invention solves the technical problems of lower intellectualization, smaller coverage and lower accuracy of the identification model of the means for identifying telecommunication fraud in the related technology.

Description

Method and device for identifying telecommunication fraud event and computer-readable storage medium

Technical Field

The invention relates to the field of information identification, in particular to a method and a device for identifying a telecommunication fraud event and a computer-readable storage medium.

Background

With the rapid development of internet technology, people's lives are changing day by day. The continuous popularization of telecommunication brings people new life experiences such as convenient shopping, account transfer, social contact and the like, but along with the new life experiences, fraud molecules carry out remote non-contact fraud in a targeted manner by utilizing network communication, victims are tricked into transferring accounts or personal information is disclosed, and money loss or personal privacy disclosure is caused.

Existing telecommunications fraud identification technologies are broadly divided into three types: manual labeling, expert rules, machine learning models. The manual marking is that a label is defined autonomously by a user or mobile phone security software, so that uncertainty and instability exist; expert rules require that service personnel have strong service experience and have limited coverage, and basically, the expert rules are based on summary and induction of historical fraud events, so that advanced perception of new fraud types cannot be achieved. Machine learning models are widely used, call or position data are used, clustering or classification models are used for identification, accuracy of the machine learning models is reduced when the machine learning models face various types of fraud modes, and the identification rate of a single model is low due to insufficient self-ability.

Therefore, the defects of small coverage and high labor cost of manual marking and expert rules in the technology exist, and the requirement of flow clustering cannot be met; the single machine learning model has the disadvantage of low accuracy rate for multiple types of fraud models.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for identifying telecommunication fraud events and a computer-readable storage medium, which are used for at least solving the technical problems of low intellectualization, small coverage and low accuracy of identification models of means for identifying telecommunication fraud in the related art.

According to an aspect of an embodiment of the present invention, there is provided a method for identifying a telecommunication fraud event, including: obtaining target telecommunications data, wherein the target telecommunications data includes at least one of: the data of conversation behavior, the data of short message behavior and the data of internet traffic; analyzing the target telecommunication data to obtain a target characteristic sequence in the target telecommunication data; determining a telecommunication data recognition result corresponding to the target feature sequence through an information prediction model, wherein the information prediction model is obtained by using multiple groups of training data through machine learning training, and each group of training data in the multiple groups of training data comprises: the method comprises the steps that a historical characteristic sequence and a historical telecommunication data identification result corresponding to the historical characteristic sequence are obtained, wherein the historical characteristic sequence is obtained by analyzing historical telecommunication data; determining whether there is a fraud risk for the target telecommunications data based on the telecommunications data identification result.

Optionally, analyzing the target telecommunication data to obtain a target feature sequence in the target telecommunication data, including: carrying out feature summarization on the target telecommunication data according to a preset dimension to obtain feature data of the target telecommunication data; and filtering the characteristic data of the target telecommunication data to obtain the target characteristic sequence.

Optionally, the call behavior data and the short message behavior data in the target telecommunication data each include at least one of the following: the method comprises the following steps of (1) calling time, calling times, calling type, roaming type, opposite terminal number and home; the internet traffic data in the target telecommunication data comprises at least one of the following data: the network access time, the network access duration, the network access times and the network access flow byte.

Optionally, before determining, by the information prediction model, the telecommunication data recognition result corresponding to the target feature sequence, the method further includes: acquiring historical telecommunication data in a historical time period; performing feature summarization on the historical telecommunication data according to the preset dimension to obtain historical feature data of the historical telecommunication data; integrating the historical characteristic data according to a preset time unit to obtain a plurality of groups of historical characteristic data sequences; filtering each group of the multiple groups of historical characteristic data sequences to obtain multiple historical characteristic sequences; determining historical telecommunication data identification results corresponding to the plurality of historical characteristic sequences to obtain the plurality of groups of training data; and training a preset network model by using the multiple groups of training data to obtain the information prediction model.

Optionally, the performing feature summarization on the historical telecommunication data according to the predetermined dimension to obtain historical feature data of the historical telecommunication data includes: clustering a plurality of historical data collection objects based on the historical telecommunication data to obtain a group fraud mode; and performing feature summarization on the historical telecommunication data based on the group fraud mode and the predetermined dimensionality to obtain historical feature data of the historical telecommunication data.

Optionally, clustering historical data collection objects based on the historical telecommunication data to obtain a group fraud pattern, comprising: determining a target historical data acquisition object of the plurality of historical data acquisition objects; searching all adjacent historical data acquisition objects of the target historical data acquisition object; traversing the adjacent historical data acquisition objects to obtain a historical data acquisition object group; acquiring preset characteristic information of the historical data acquisition object group; clustering the historical data collection objects based on the predetermined characteristic information to obtain the group fraud mode.

Optionally, determining whether the target telecommunications data is at risk of fraud based on the telecommunications data identification result, comprising: generating a prompt message upon determining that the target telecommunications data is at a fraud risk based on the telecommunications data identification result.

According to another aspect of the embodiment of the present invention, there is also provided an apparatus for identifying a telecommunication fraud event, including: a first obtaining module, configured to obtain target telecommunication data, where the target telecommunication data includes at least one of: the data of conversation behavior, the data of short message behavior and the data of internet traffic; the analysis module is used for analyzing the target telecommunication data to obtain a target characteristic sequence in the target telecommunication data; a first determining module, configured to determine, through an information prediction model, a telecommunication data recognition result corresponding to the target feature sequence, where the information prediction model is obtained through machine learning training using multiple sets of training data, and each set of training data in the multiple sets of training data includes: the method comprises the steps that a historical characteristic sequence and a historical telecommunication data identification result corresponding to the historical characteristic sequence are obtained, wherein the historical characteristic sequence is obtained by analyzing historical telecommunication data; a second determining module for determining whether there is a fraud risk for the target telecommunications data based on the telecommunications data identification result.

Optionally, the analysis module comprises: the first characteristic summarizing unit is used for summarizing the characteristics of the target telecommunication data according to a preset dimension to obtain the characteristic data of the target telecommunication data; and the filtering unit is used for filtering the characteristic data of the target telecommunication data to obtain the target characteristic sequence.

Optionally, the apparatus further comprises: the second acquisition module is used for acquiring historical telecommunication data in a historical time period before a telecommunication data identification result corresponding to the target characteristic sequence is determined through an information prediction model; the characteristic summarizing module is used for summarizing the characteristics of the historical telecommunication data according to the preset dimensionality to obtain historical characteristic data of the historical telecommunication data; the integration module is used for integrating the historical characteristic data according to a preset time unit to obtain a plurality of groups of historical characteristic data sequences; the filtering module is used for filtering each group of the multiple groups of historical characteristic data sequences to obtain multiple historical characteristic sequences; a third determining module, configured to determine historical telecommunication data recognition results corresponding to the multiple historical feature sequences to obtain the multiple sets of training data; and the training module is used for training a predetermined network model by using the plurality of groups of training data to obtain the information prediction model.

Optionally, the feature summarizing module includes: the clustering unit is used for clustering a plurality of historical data acquisition objects based on the historical telecommunication data to obtain a group fraud mode; a second feature summarizing unit, configured to perform feature summarization on the historical telecommunication data based on the group fraud mode and the predetermined dimension, so as to obtain historical feature data of the historical telecommunication data.

Optionally, the clustering unit includes: a determination subunit configured to determine a target historical data acquisition object of the plurality of historical data acquisition objects; a searching subunit, configured to search all neighboring historical data acquisition objects of the target historical data acquisition object; the traversing subunit is used for traversing the adjacent historical data acquisition objects to obtain a historical data acquisition object group; the acquisition subunit is used for acquiring preset characteristic information of the historical data acquisition object group; and the clustering subunit is used for clustering the historical data acquisition objects based on the predetermined characteristic information to obtain the group fraud mode.

Optionally, the second determining module includes: a generating unit, configured to generate prompt information when it is determined that the target telecommunication data has a fraud risk based on the telecommunication data identification result.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, comprising a stored computer program, wherein when the computer program is executed by a processor, the computer program controls a device in which the computer-readable storage medium is located to execute the method for identifying a telecommunication fraud event.

According to another aspect of the embodiments of the present invention, there is also provided a processor for executing a computer program, wherein the computer program executes to perform the method for identifying a telecommunication fraud event described in any one of the above.

In an embodiment of the present invention, target telecommunication data is obtained, wherein the target telecommunication data includes at least one of: the data of conversation behavior, the data of short message behavior and the data of internet traffic; analyzing the target telecommunication data to obtain a target characteristic sequence in the target telecommunication data; determining a telecommunication data recognition result corresponding to the target characteristic sequence through an information prediction model, wherein the information prediction model is obtained by using a plurality of groups of training data through machine learning training, and each group of training data in the plurality of groups of training data comprises: the method comprises the steps that a historical characteristic sequence and a historical telecommunication data identification result corresponding to the historical characteristic sequence are obtained, and the historical characteristic sequence is obtained by analyzing historical telecommunication data; determining whether the target telecommunication data has a fraud risk based on the telecommunication data identification result. By the method for identifying the telecommunication fraud event, the purpose of identifying and determining whether the telecommunication fraud risk exists or not based on the target telecommunication data and the information prediction model is achieved, so that the technical effect of telecommunication identification intellectualization is improved, and the technical problems of low intellectualization, small coverage and low accuracy of the identification model of a telecommunication fraud identification means in the related technology are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of an identification method of a telecommunication fraud event according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a converged deep learning network framework, according to an embodiment of the invention;

FIG. 3 is a schematic diagram of neural network and integrated model joint training in accordance with an embodiment of the present invention;

FIG. 4 is a flow chart of a preferred method of identifying telecommunication fraud events according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an identification apparatus of a telecommunication fraud event according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

According to an embodiment of the present invention, there is provided a method embodiment of a method for identifying a telecommunication fraud event, it should be noted that the steps illustrated in the flowchart of the drawings may be executed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be executed in an order different from that herein.

FIG. 1 is a flow chart of an identification method of a telecommunication fraud event according to an embodiment of the present invention, as shown in FIG. 1, the method comprises the following steps:

step S102, obtaining target telecommunication data, wherein the target telecommunication data comprises at least one of the following data: conversation behavior data, short message behavior data and internet traffic data.

Optionally, in the above step, the call and short message behavior data includes but is not limited to: the method comprises the following steps of (1) calling time, calling/short message times, calling type, calling roaming type, calling/short message opposite terminal number, calling/short message attribution and the like; the internet traffic data includes but is not limited to: the network access time, the network access duration, the network access times, the network access flow byte and the like.

And step S104, analyzing the target telecommunication data to obtain a target characteristic sequence in the target telecommunication data.

Optionally, in the above step, the target feature sequence is a feature sequence obtained by clustering the users by using a predetermined algorithm.

Step S106, determining a telecommunication data recognition result corresponding to the target characteristic sequence through an information prediction model, wherein the information prediction model is obtained by using a plurality of groups of training data through machine learning training, and each group of training data in the plurality of groups of training data comprises: and identifying results of the historical telecommunication data corresponding to the historical characteristic sequence, wherein the historical characteristic sequence is obtained by analyzing the historical telecommunication data.

Optionally, the information prediction model is obtained by machine learning training using multiple sets of training data, where the multiple sets of training data include but are not limited to: historical signature sequences and historical telecommunication data identification results corresponding to the historical signature sequences.

Step S108, whether the target telecommunication data has fraud risk is determined based on the telecommunication data identification result.

As can be seen from the above, in the embodiment of the present invention, first, target telecommunication data may be obtained, where the target telecommunication data includes at least one of the following: the data of conversation behavior, the data of short message behavior and the data of internet traffic; then, the target telecommunication data can be analyzed to obtain a target characteristic sequence in the target telecommunication data; then, the telecommunication data recognition result corresponding to the target feature sequence can be determined through an information prediction model, wherein the information prediction model is obtained by using a plurality of groups of training data through machine learning training, and each group of training data in the plurality of groups of training data comprises: the method comprises the steps that a historical characteristic sequence and a historical telecommunication data identification result corresponding to the historical characteristic sequence are obtained, and the historical characteristic sequence is obtained by analyzing historical telecommunication data; finally, whether the target telecommunication data has fraud risk can be determined based on the telecommunication data identification result. By the method for identifying the telecommunication fraud event, the purpose of identifying and determining whether the telecommunication fraud risk exists or not based on the target telecommunication data and the information prediction model is achieved, so that the technical effect of telecommunication identification intellectualization is improved, and the technical problems of low intellectualization, small coverage and low accuracy of the identification model of a telecommunication fraud identification means in the related technology are solved.

As an alternative embodiment, analyzing the target telecommunication data to obtain a target feature sequence in the target telecommunication data includes: carrying out feature summarization on the target telecommunication data according to a preset dimension to obtain feature data of the target telecommunication data; and filtering the characteristic data of the target telecommunication data to obtain a target characteristic sequence.

In the above optional embodiment, first, feature summarization is performed on the acquired target telecommunication data based on a set threshold (i.e. a predetermined dimension), and then some invalid interference information is filtered out, so as to obtain a target feature sequence.

Further, summarizing and processing the daily telecommunication data, such as counting the number of calls of a calling party in a day by a user to obtain daily summarized feature sequence data of the user, removing features with poor data quality, such as deleting the features with higher missing value and unique value or unstable and large fluctuation along with time, and finally determining available feature information.

As an optional embodiment, the call behavior data and the short message behavior data in the target telecommunication data each include at least one of the following: the method comprises the following steps of (1) calling time, calling times, calling type, roaming type, opposite terminal number and home; the internet traffic data in the target telecommunication data comprises at least one of the following data: the network access time, the network access duration, the network access times and the network access flow byte.

In the above optional embodiment, the telecommunication data integrating three dimensions of the daily call behavior, the short message behavior and the internet traffic of the user is obtained. The data of the call and the short message comprises: time, times, call type, roaming type, opposite terminal number, home, etc.; the internet traffic data comprises: time, duration, number of times, flow bytes, etc.

As an alternative embodiment, before determining the telecommunication data identification result corresponding to the target feature sequence through the information prediction model, the identification method of telecommunication fraud events further comprises: acquiring historical telecommunication data in a historical time period; performing feature summarization on historical telecommunication data according to a preset dimension to obtain historical feature data of the historical telecommunication data; integrating the historical characteristic data according to a preset time unit to obtain a plurality of groups of historical characteristic data sequences; filtering each group of the multiple groups of historical characteristic data sequences to obtain multiple historical characteristic sequences; determining historical telecommunication data recognition results corresponding to the plurality of historical characteristic sequences to obtain a plurality of groups of training data; and training the predetermined network model by using a plurality of groups of training data to obtain an information prediction model.

Fig. 2 is a schematic diagram of a converged deep learning network framework according to an embodiment of the present invention, and as shown in fig. 2, the steps of building the converged deep learning network framework are described in detail as follows:

step 1) initializing a mapping matrix. For the user group S, the number of the user sample features is m, the number of the mapped features is n, and the mapping matrix is W e to R^m×nThe matrix element default mean is 0 and the standard deviation is

Is normally distributed with random values.

And 2) constructing a fusion deep learning network framework to generate a feature expression vector. Aiming at daily feature data of a user with time series property, deep features can be mined and generated from the shallow features by utilizing a deep learning network, and user sample information is extracted to a greater extent, so that two deep learning networks of LSTM and CNN are selected and share an embedding layer (mapping matrix W).

It should be noted that LSTM (Long short-term memory neural network) utilizes a door mechanism and a memory cell unit to control information flow direction on the basis of a recurrent neural network to solve the Long-term dependence problem, has a characteristic of parameter sharing in a time dimension, and has an effect of information memory integration; the CNN (Convolutional Neural Networks for short) is formed by overlapping and crossing a Convolutional layer, a pooling layer and a full-link layer, has the characteristics of local connection and weight sharing, and has the functions of extracting local information and obtaining global information according to the local information.

The specific execution steps for obtaining the target data feature vector are as follows:

1. for user sample x, use user daily sequence feature data { p_x,tAnd t is w, …,2,1, and is mapped into a dense vector sequence { q } through an embedding layer_x,t,t＝w,…,2,1}。

2. Dense vector sequence q_x,tT ═ w, …,2,1} is input to LSTM, and information is encoded to obtain a sequence of hidden layer units { h ═ h_x,tW, …,2,1}, wherein the layer unit h is hidden_x,tIs not only dependent on q of the input layer_x,tAnd a hidden layer unit h at the previous moment_x,t-1And memory cell unit c_x,t-1I.e. h_x,t＝f(q_x,t,h_x,t-1,c_x,t-1U, V, b), where f is the process function, U and V are the weight matrices, and b is the bias term.

3. Dense vector sequence q_x,tT is w, …,2,1, which is the input of the one-dimensional CNN, the convolution kernel slides and convolves only in the direction of the date, and the vector s is obtained by pooling and padding (0 complementing)_xI.e. s_x＝g({q_x,t},{M_iB), where g is a process function, { M } M)_iIs the convolution kernel sequence, i is the convolution kernelNumber, b is the bias term.

4. Stitching vector h_x,1And s_xObtaining a feature representation vector o of a user sample x_x＝[h_x,1,s_x]。

As an alternative embodiment, the performing feature summarization on historical telecommunication data according to a predetermined dimension to obtain historical feature data of the historical telecommunication data includes: clustering a plurality of historical data acquisition objects based on historical telecommunication data to obtain a group fraud mode; and carrying out feature summarization on the historical telecommunication data based on the group fraud mode and the predetermined dimensionality to obtain historical feature data of the historical telecommunication data.

In the above-mentioned alternative embodiment, the historical telecommunication data is clustered with the source objects of the telecommunication data thereof to obtain the classified historical telecommunication fraud data, i.e. group fraud patterns, and then the historical telecommunication fraud data is summarized with the feature classification based on the group fraud patterns and the predetermined dimension to obtain the historical feature data of the historical telecommunication data.

It should be noted that, in the clustering, a data set is divided into different classes or clusters according to a certain specific criterion (e.g., distance), so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects not in the same cluster is also as large as possible, that is, the clustered data of the same class are gathered together as much as possible, and the data of different classes are separated as much as possible.

As an alternative embodiment, clustering historical data collection objects based on historical telecommunication data to obtain group fraud patterns includes: determining a target historical data acquisition object of a plurality of historical data acquisition objects; searching all adjacent historical data acquisition objects of the target historical data acquisition object; traversing adjacent historical data acquisition objects to obtain a historical data acquisition object group; acquiring preset characteristic information of a historical data acquisition object group; and clustering the historical data acquisition objects based on the preset characteristic information to obtain a group fraud mode.

FIG. 3 is a schematic diagram of a neural network and integrated model joint training according to an embodiment of the present invention, as shown in FIG. 3For the user group S, the feature expression vector o is obtained by fusing the deep learning network_x∈S(i.e., target feature sequence) for learning feature representation and model training, a neural network NN is employed_SAnd integration model F_SJoint training, namely setting the loss of two models respectively in a user sample x ∈ S as l_NN,xAnd l_F,xThe final loss can be set to l_x＝αl_NN,x+(1-α)l_F,xThen the overall loss is

Two models, feature representation and training, are learned by minimizing the overall loss.

As an alternative embodiment, determining whether the target telecommunication data is at fraud risk based on the telecommunication data identification result comprises: and generating prompt information when the target telecommunication data is determined to have fraud risk based on the telecommunication data identification result.

FIG. 4 is a flowchart of a preferred telecommunication fraud event identification method according to an embodiment of the present invention, and as shown in FIG. 4, the embodiment of the present invention provides a telecommunication fraud identification method based on a clustering user grouping and deep learning fusion model, and the specific steps thereof are described in detail below:

step one, data preparation and pretreatment:

1) determining fraud user samples and normal user samples according to existing fraud reporting cues;

2) and acquiring telecommunication data integrating three dimensions of daily call behavior, short message behavior and internet traffic of the user in the near day. The data of the call and the short message comprises: time, times, call type, roaming type, opposite terminal number, home, etc.; the internet traffic data comprises: time, duration, number of times, flow byte, etc.;

3) summarizing and processing the daily telecommunication data, such as counting the calling times of a calling party in one day of a user to obtain daily summarized feature sequence data of the user in the near day, removing features with poor data quality, such as deleting the features with higher missing value and unique value or unstable and large fluctuation along with time, and finally determining m available features.

Step two, clustering users, determining a group fraud mode:

1) user clustering using DBSCAN algorithm. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a Density-Based Spatial Clustering algorithm. The algorithm divides the area with sufficient density into clusters and finds arbitrarily shaped clusters in a spatial database with noise, which defines clusters as the largest set of density-connected points. The specific execution steps are as follows:

step 1, defining the sample distance d between two users_i,jThe distance can be Euclidean distance or Mahalanobis distance and the like;

step 2, randomly selecting a user sample x, and then finding all user neighbors S with the distance less than or equal to epsilon from the user sample x_x,ε. If set S_x,εIs less than u_minThen this user sample is marked as noise and otherwise as a core sample and assigned a new cluster label.

Step 3, visit the neighbour S of this user_x,ε. If they have not already been assigned a cluster, then a new cluster label created previously is assigned to them. If they are core samples, then their neighbors are visited in turn, and so on. The clusters are gradually increased until there are no more core samples within the epsilon distance of the cluster.

And 4, selecting another user sample which is not accessed yet, and repeating the processes in the steps 1-3 until all samples are accessed, thereby finally obtaining the user cluster group (group).

2) Determining the user group according to the black sample concentration. If there are only a few fraudulent users in the user group obtained in the first step, it is very difficult to train the model subsequently, and it is an unwise choice in terms of training cost and effect, so it is necessary to merge these user groups. The method comprises the following specific steps:

step 1, setting the concentration of black samples of each user group as rho, namely, usersThe proportion of fraud samples in the group is

|N_badAnd | is the number of fraud samples in the user group, and | is the number of samples in the user group.

Step 2, setting a threshold value gamma₁And gamma₂Wherein γ is₁＜γ₂. Let rho < gamma₁The user groups are combined and set as a white user group, and gamma is set₁＜ρ＜γ₂The user groups of (1) are merged and set as low-concentration fraud user groups, and the rest user groups are not changed.

And 3, determining the aggregation reason of the user group by using an FP-growth algorithm. The FP-growth algorithm adopts a divide-and-conquer strategy and excavates frequent items and associated information. The specific execution steps are as follows:

first, the feature information in the user group is scanned once, the frequent 1 item set is found out, and the frequent 1 item set are arranged in descending order according to the support count.

Secondly, based on the frequent 1 item set, scanning the group user information again, constructing an FP tree representing the association of the group information item set, and recursively finding out all the frequent item sets on the FP.

Finally, a strong association rule, i.e., association information of the user features, i.e., a user aggregation pattern, is generated in all frequent item sets.

Step three, fusing deep learning to generate feature representation:

1) initializing the mapping matrix. For the user group S, the number of the user sample features is m, the number of the mapped features is n, and the mapping matrix is W e to R^m×nThe matrix element default mean is 0 and the standard deviation is

A normal distribution random value of;

2) constructing a fusion deep learning network framework and generating a feature expression vector. Aiming at daily feature data of a user with time series property, deep features can be mined and generated from the shallow features by utilizing a deep learning network, and user sample information is extracted to a greater extent, so that two deep learning networks of LSTM and CNN are selected and share an embedding layer (mapping matrix W).

The LSTM (Long short-term memory neural network) utilizes a door mechanism and a memory cell unit to control information flow direction on the basis of a recurrent neural network to solve the problem of Long-term dependence, has the characteristic of parameter sharing in the time dimension and has the function of information memory integration; the CNN (convolutional neural networks) is formed by overlapping and intersecting convolutional layers, pooling layers, and full-link layers, has the characteristics of local link and weight sharing, and has the functions of extracting local information and obtaining global information according to the local information. The specific execution steps are as follows:

step 1, for a user sample x, using user daily sequence feature data { p }_x,tAnd t is w, …,2,1, and is mapped into a dense vector sequence { q } through an embedding layer_x,t,t＝w,…,2,1}；

Step 2. dense vector sequence { q_x,tT ═ w, …,2,1} is input to LSTM, and information is encoded to obtain a sequence of hidden layer units { h ═ h_x,tW, …,2,1}, wherein the layer unit h is hidden_x,tIs not only dependent on q of the input layer_x,tAnd a hidden layer unit h at the previous moment_x,t-1And memory cell unit c_x,t-1I.e. h_x,t＝f(q_x,t,h_x,t-1,c_x,t-1U, V, b), where f is the process function, U and V are the weight matrices, and b is the bias term.

3) Dense vector sequence q_x,tT is w, …,2,1, which is the input of the one-dimensional CNN, the convolution kernel slides and convolves only in the direction of the date, and the vector s is obtained by pooling and padding (0 complementing)_xI.e. s_x＝g({q_x,t},{M_iB), where g is a process function, { M } M)_iIs the convolution kernel sequence, i is the number of convolution kernels, and b is the bias term.

4) The concatenation vector h_x,1And s_xObtaining a feature representation vector o of a user sample x_x＝[h_x,1,s_x]。

Step four, training the neural network and the integrated model in a combined manner:

for the user group S, the feature expression vector o is obtained by fusing the deep learning network_x∈SFor learning feature representation and model training, neural network NN is adopted_SAnd integration model F_SJoint training, namely setting the loss of two models respectively in a user sample x ∈ S as l_NN,xAnd l_F,xThe final loss can be set to l_x＝αl_NN,x+(1-α)l_F,xThen the overall loss is

Step five, joint model prediction:

1) calculating a center point for each user group;

2) for the user sample x, calculating the distance between the user sample x and the central point of each user group, and taking the user group with the minimum distance as the home group of the sample x;

3) and obtaining the characteristic representation of the user group according to the deep learning fusion network framework of the user group, and obtaining the prediction result of the sample according to the neural network and the integration model.

Further, the steps of the preferred embodiment provided by the present invention may be:

1. and (4) acquiring the telecommunication data of the user for a number of days, wherein the experience value is w is 90.

2. The user-available features are: the calling times, the calling roaming time, the number of calling opposite terminals, the short message receiving times, the internet traffic, the internet times and other 254 characteristics.

3. The sample distance of the user is the euclidean distance.

4. And the sample distance threshold epsilon and the user number threshold u _ min are empirically obtained, wherein epsilon is 200, and u _ min is 10.

5. The black sample density threshold is empirically set to γ _1 ═ 0.01 and γ _2 ═ 0.05.

6. The number of features after mapping is empirically taken as n-100.

7. The number of convolution kernels is taken as i-5 empirically.

8. The integration model F _ S, empirically defined as LightGBM.

9. The parameter α is empirically set to 0.5.

10. Sample loss, empirically defined as cross entropy.

From the above, through the telecommunication fraud recognition method based on the clustering user grouping and deep learning fusion model of the embodiment of the present invention, and in combination with telecommunication data such as user communication short message traffic, the user groups are determined by the clustering method and the black sample concentration, the feature representation is obtained by fusion deep learning for each user group, the combined training is performed by the neural network and the integration model, the sample loss is redefined, the problems of manual operation, small coverage and low accuracy are effectively solved, the timely tracking and positioning of fraud numbers are realized, and the efficiency and performance of telecommunication fraud recognition are improved, in addition, the beneficial effects of the embodiment of the present invention are as follows:

1. the user grouping concept and method are creatively introduced into the field of telecom fraud identification, a user group is obtained by adopting a clustering method and black sample concentration, the multimode fraud identification problem is effectively solved, and the effects of modeling refinement and flow are realized;

2. the LSTM and the CNN are organically fused into the characteristic representation learning process to obtain richer and more sufficient sample information, and the characteristic representation is applied to model training, so that the training effect of the model is effectively improved;

3. sample loss is innovatively defined, and a neural network and an integrated model are jointly trained, so that the problem of low single model prediction accuracy is effectively solved, and the problems of manual work and slow timeliness are avoided.

Example 2

According to another aspect of the embodiment of the present invention, there is also provided an identification apparatus of a telecommunication fraud event, fig. 5 is a schematic diagram of the identification apparatus of a telecommunication fraud event according to the embodiment of the present invention, as shown in fig. 5, including: a first obtaining module 51, an analyzing module 53, a first determining module 55 and a second determining module 57. The following describes the identification device of the telecommunication fraud event.

A first obtaining module 51, configured to obtain target telecommunication data, where the target telecommunication data includes at least one of: the data of conversation behavior, the data of short message behavior and the data of internet traffic;

the analysis module 53 is configured to analyze the target telecommunication data to obtain a target feature sequence in the target telecommunication data;

a first determining module 55, configured to determine, by an information prediction model, a telecommunication data recognition result corresponding to the target feature sequence, where the information prediction model is obtained by machine learning training using multiple sets of training data, and each set of training data in the multiple sets of training data includes: the method comprises the steps that a historical characteristic sequence and a historical telecommunication data identification result corresponding to the historical characteristic sequence are obtained, and the historical characteristic sequence is obtained by analyzing historical telecommunication data;

a second determining module 57, configured to determine whether the target telecommunication data is at a fraud risk based on the telecommunication data identification result.

It should be noted here that the first obtaining module 51, the analyzing module 53, the first determining module 55, and the second determining module 57 correspond to steps S102 to S108 in embodiment 1, and the modules are the same as the corresponding steps in implementation examples and application scenarios, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as part of an apparatus may be implemented in a computer system such as a set of computer-executable instructions.

As can be seen from the above, in the embodiment of the present invention, the first obtaining module 51 may first obtain target telecommunication data, where the target telecommunication data includes at least one of the following: the data of conversation behavior, the data of short message behavior and the data of internet traffic; then, the analysis module 53 may analyze the target telecommunication data to obtain a target feature sequence in the target telecommunication data; the first determining module 55 may then determine the recognition result of the telecommunication data corresponding to the target feature sequence through an information prediction model, where the information prediction model is obtained through machine learning training using multiple sets of training data, and each set of training data in the multiple sets of training data includes: the method comprises the steps that a historical characteristic sequence and a historical telecommunication data identification result corresponding to the historical characteristic sequence are obtained, and the historical characteristic sequence is obtained by analyzing historical telecommunication data; finally, the second determining module 57 can determine whether the target telecommunication data is at risk of fraud based on the telecommunication data identification result. The telecommunication fraud event identification device provided by the embodiment of the invention achieves the purpose of identifying and determining whether the telecommunication fraud risk exists or not based on the target telecommunication data and the information prediction model, thereby realizing the technical effect of improving telecommunication identification intellectualization, and further solving the technical problems of lower intellectualization, smaller coverage and lower accuracy of identification models aiming at the telecommunication fraud identification means in the related technology.

Optionally, an analysis module comprising: the first characteristic summarizing unit is used for summarizing the characteristics of the target telecommunication data according to the preset dimensionality to obtain the characteristic data of the target telecommunication data; and the filtering unit is used for filtering the characteristic data of the target telecommunication data to obtain a target characteristic sequence.

Optionally, the telecommunication fraud event identification device further comprises: the second acquisition module is used for acquiring historical telecommunication data in a historical time period before the telecommunication data identification result corresponding to the target characteristic sequence is determined through the information prediction model; the characteristic summarizing module is used for summarizing the characteristics of the historical telecommunication data according to the preset dimensionality to obtain the historical characteristic data of the historical telecommunication data; the integration module is used for integrating the historical characteristic data according to a preset time unit to obtain a plurality of groups of historical characteristic data sequences; the filtering module is used for filtering each group of the multiple groups of historical characteristic data sequences to obtain multiple historical characteristic sequences; the third determining module is used for determining historical telecommunication data recognition results corresponding to the plurality of historical characteristic sequences to obtain a plurality of groups of training data; and the training module is used for training the preset network model by utilizing a plurality of groups of training data to obtain an information prediction model.

Optionally, the feature summarizing module includes: the clustering unit is used for clustering a plurality of historical data acquisition objects based on historical telecommunication data to obtain a group fraud mode; and the second characteristic summarizing unit is used for summarizing the characteristics of the historical telecommunication data based on the group fraud mode and the preset dimensionality to obtain the historical characteristic data of the historical telecommunication data.

Optionally, the clustering unit includes: a determination subunit configured to determine a target historical data acquisition object of the plurality of historical data acquisition objects; the searching subunit is used for searching all adjacent historical data acquisition objects of the target historical data acquisition object; the traversing subunit is used for traversing the adjacent historical data acquisition objects to obtain a historical data acquisition object group; the acquisition subunit is used for acquiring preset characteristic information of the historical data acquisition object group; and the clustering subunit is used for clustering the historical data acquisition objects based on the preset characteristic information to obtain a group fraud mode.

Optionally, the second determining module includes: and the generating unit is used for generating prompt information when the target telecommunication data is determined to have fraud risk based on the telecommunication data identification result.

Example 3

According to another aspect of the embodiments of the present invention, there is further provided a computer-readable storage medium comprising a stored computer program, wherein the computer program, when executed by the processor, controls an apparatus where the computer-readable storage medium is located to perform the method for identifying a telecommunication fraud event of any of the above.

Example 4

According to another aspect of the embodiment of the present invention, there is further provided a processor for executing the computer program, wherein the computer program executes the method for identifying a telecommunication fraud event of any one of the above.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for identifying a telecommunication fraud event, comprising:

obtaining target telecommunications data, wherein the target telecommunications data includes at least one of: the data of conversation behavior, the data of short message behavior and the data of internet traffic;

analyzing the target telecommunication data to obtain a target characteristic sequence in the target telecommunication data;

determining a telecommunication data recognition result corresponding to the target feature sequence through an information prediction model, wherein the information prediction model is obtained by using multiple groups of training data through machine learning training, and each group of training data in the multiple groups of training data comprises: the method comprises the steps that a historical characteristic sequence and a historical telecommunication data identification result corresponding to the historical characteristic sequence are obtained, wherein the historical characteristic sequence is obtained by analyzing historical telecommunication data;

determining whether there is a fraud risk for the target telecommunications data based on the telecommunications data identification result.

2. The method of claim 1, wherein analyzing the target telecommunication data to obtain a target feature sequence in the target telecommunication data comprises:

carrying out feature summarization on the target telecommunication data according to a preset dimension to obtain feature data of the target telecommunication data;

and filtering the characteristic data of the target telecommunication data to obtain the target characteristic sequence.

3. The method of claim 1, wherein the call behavior data and the short message behavior data in the target telecommunication data each comprise at least one of: the method comprises the following steps of (1) calling time, calling times, calling type, roaming type, opposite terminal number and home; the internet traffic data in the target telecommunication data comprises at least one of the following data: the network access time, the network access duration, the network access times and the network access flow byte.

4. The method of claim 2, wherein prior to determining, by an information prediction model, a telecommunication data recognition result corresponding to the target feature sequence, the method further comprises:

acquiring historical telecommunication data in a historical time period;

performing feature summarization on the historical telecommunication data according to the preset dimension to obtain historical feature data of the historical telecommunication data;

integrating the historical characteristic data according to a preset time unit to obtain a plurality of groups of historical characteristic data sequences;

filtering each group of the multiple groups of historical characteristic data sequences to obtain multiple historical characteristic sequences;

determining historical telecommunication data identification results corresponding to the plurality of historical characteristic sequences to obtain the plurality of groups of training data;

and training a preset network model by using the multiple groups of training data to obtain the information prediction model.

5. The method of claim 4, wherein performing a feature summarization on the historical telecommunication data according to the predetermined dimension to obtain historical feature data of the historical telecommunication data comprises:

clustering a plurality of historical data collection objects based on the historical telecommunication data to obtain a group fraud mode;

and performing feature summarization on the historical telecommunication data based on the group fraud mode and the predetermined dimensionality to obtain historical feature data of the historical telecommunication data.

6. The method as recited in claim 5, wherein clustering historical data collection objects based on said historical telecommunications data, resulting in group fraud patterns, comprises:

determining a target historical data acquisition object of the plurality of historical data acquisition objects;

searching all adjacent historical data acquisition objects of the target historical data acquisition object;

traversing the adjacent historical data acquisition objects to obtain a historical data acquisition object group;

acquiring preset characteristic information of the historical data acquisition object group;

clustering the historical data collection objects based on the predetermined characteristic information to obtain the group fraud mode.

7. The method as recited in any one of claims 1 to 6, wherein determining whether there is a fraud risk for said target telecommunications data based on said telecommunications data identification result comprises:

generating a prompt message upon determining that the target telecommunications data is at a fraud risk based on the telecommunications data identification result.

8. An apparatus for identifying a telecommunication fraud event, comprising:

a first obtaining module, configured to obtain target telecommunication data, where the target telecommunication data includes at least one of: the data of conversation behavior, the data of short message behavior and the data of internet traffic;

the analysis module is used for analyzing the target telecommunication data to obtain a target characteristic sequence in the target telecommunication data;

a first determining module, configured to determine, through an information prediction model, a telecommunication data recognition result corresponding to the target feature sequence, where the information prediction model is obtained through machine learning training using multiple sets of training data, and each set of training data in the multiple sets of training data includes: the method comprises the steps that a historical characteristic sequence and a historical telecommunication data identification result corresponding to the historical characteristic sequence are obtained, wherein the historical characteristic sequence is obtained by analyzing historical telecommunication data;

a second determining module for determining whether there is a fraud risk for the target telecommunications data based on the telecommunications data identification result.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program, wherein when the computer program is executed by a processor, the computer-readable storage medium is controlled at a device to execute the method for identifying a telecommunication fraud event recited in any one of claims 1 to 7.

10. A processor, characterized in that the processor is configured to run a computer program, wherein the computer program is configured to execute the method for identifying a telecommunication fraud event of any of the above claims 1 to 7 when running.