CN112667863B

CN112667863B - Financial fraud group identification method based on hypergraph segmentation

Info

Publication number: CN112667863B
Application number: CN202110058766.5A
Authority: CN
Inventors: 张涛; 张宗旺
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-01-16
Filing date: 2021-01-16
Publication date: 2024-02-02
Anticipated expiration: 2041-01-16
Also published as: CN112667863A

Abstract

The invention discloses a financial fraud group identification method based on hypergraph segmentation, which decomposes the hypergraph segmentation process into 6 subprocesses: feature extraction, data normalization processing, data cleaning, data storage, index selection, call network construction, adjacency matrix construction, nonnegative matrix factorization and result acquisition. The invention is realized based on the concept of one-side multi-node hypergraph, the property of the hypergraph can enable the model to obtain the approximate value of the global optimal solution, and the current group fraud identification scheme is mostly based on the algorithm of the traditional graph and can only consider the local optimal solution. And the sphere of detection of the group is more dependent on the global optimal solution. For a scheme of high-latitude calculation, a hypergraph regularization term is added into a loss function of non-negative matrix factorization, so that high-dimensional information can be encoded, and iteration efficiency is improved.

Description

Financial fraud group identification method based on hypergraph segmentation

Technical Field

The invention belongs to the fields of financial anti-fraud and machine learning, and relates to an effective method for identifying financial fraud partners.

Background

In the field of internet finance, fraud is the most dominant factor in the loss of lending institutions, and research has found that credit fraud is often a partner, and that these partners are necessarily directly linked to each other.

The method for finding the trace of the fraudulent group is relatively feasible and effective by analyzing the social behavior of the client by the operator data, but the operator communication data is quite huge, and the general statistical method is incapable of carrying out effective analysis, so that the client group is partitioned by means of a machine learning technology to find the fraudulent group. The hypergraph breaks through a graph which commonly describes binary relationships, and one hyperedge can contain a plurality of vertexes, so that the hypergraph is more suitable for describing multiple relationships. So far, no efficient community segmentation method based on hypergraph is applied to the field of financial anti-fraud.

Hypergraph (Hypergraph) is a generalized graph whose one edge can connect any number of vertices. Whereas a common graph has only two vertices on one side. When the common graph is expanded to the hypergraph, the relation among the nodes becomes higher-order. The network can be projected into a potential low-dimensional space based on a non-negative matrix factorization method, and the result can be expressed in a distinguishing way; and secondly, the calculation can be made efficient and feasible.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a financial fraud partner identification method based on hypergraph segmentation, so as to solve the problems that the traditional method only considers the limitation of local feasibility and how to efficiently calculate for a high-order information network due to the use of a common graph. And constructing an information network by using call records among individuals, and aggregating the network by using a hypergraph segmentation method. And because of the characteristic of aggregation of the partner fraud, the result of partner identification can be obtained.

In order to solve the problems, the invention adopts the following technical scheme:

a financial fraud partner identification method based on hypergraph segmentation comprises the following steps:

step S1, feature extraction: factors such as contact person ID, call duration, call times and the like are extracted from the original data of the call records to form a JSON string form, so that the subsequent data processing is convenient;

step S2, data normalization processing: the first rule is that when the caller ID is the same as the callee, the caller ID does not meet the specification, and the record is deleted; the second rule is that the end time-start time is not equal to the call duration or the end time is earlier than the start time, and the record is deleted;

step S3, data cleaning: and deleting call records such as harassment, express delivery, meal delivery, promotion, invalid numbers, service numbers and the like. The interference call records can promote the contact person IDs which are not related to each other to be directly clustered, so that the result of cluster recognition is interfered;

step S4, data storage: the cleaned data is stored in JanusGraph, so that development of a system level is facilitated;

step S5, index selection and call network construction: taking the contact person ID as a node of the network, taking call association as a side of the network, and selecting indexes such as call times, duration, information entropy, time interval and the like to finish construction of a call information network;

step S6: constructing an adjacency matrix: calculating weights according to the indexes in the S5, and constructing an adjacency matrix of the undirected graph according to the weights, wherein the weights E [0,1];

step S7: non-negative matrix factorization: critical matrix a in S6The method comprises the following steps of: a is approximately equal to WH ^T The method comprises the steps of carrying out a first treatment on the surface of the Wherein,different from the traditional meaning, one edge only has two node diagrams, all nodes of the hypergraph are in a high-dimensional space, the hypergraph regular term is added to encode the high-order relation of nonnegative matrix factorization, and the loss function is: />Where k represents the number of divided communities, a _ij Representing node v _i And node v _j Probability of connection, w _il And h _hl Representing v _i The probability that the degree of ingress and egress belongs to community l. W= [ W ] _il ]∈R ^n×k And H= [ H ] _il ]∈R ^n×k Is a non-negative matrix. z _ij Representing node v _i And node v _j Is of the order of (1), Z= [ Z ] _il ]∈R ^n×n Is reliable a priori information (if v _i And v _j Z is not related to _ij =0, if v _i And v _j Belonging to the same community, z _ij =1 and h _i And h _j Approximately equal), λ is the adjustment parameter between the regularization term and the loss function.

Step S8: the result is obtained: according to the loss function in S7, iterating, updating the rule through W and H, converging the function, and according toAnd obtaining a result of the community detection algorithm. According to the characteristic that the fraudulent group has clustering in the call network, the clustered nodes are identified as members of the fraudulent group.

Preferably, the information such as indexes used by the model is obtained by carrying out feature development on the original data;

preferably, the research result is applied to the development of an actual system through the construction of a call information network, and plays a key role in the construction of an anti-fraud engine;

preferably, the traditional graph of different paired nodes is considered, and a hypergraph concept of one-side multi-node is provided, so that the accuracy of an algorithm result is higher;

preferably, the calculation is made feasible by means of non-negative matrix factorization;

preferably, the loss function of the hypergraph regularization term is added to encode the high-latitude information, so that the efficiency of the hypergraph-based graph segmentation algorithm is greatly improved.

The above-graph segmentation-based group partner fraud identification method is decoupled into a plurality of sub-processes, realizes model iteration and engine development under a big data scene through task encapsulation, allocation and flow control, and can intercept more than 99% of group partner fraud.

The implementation technical scheme of the invention is as follows: the hypergraph segmentation process is broken down into 6 sub-processes: feature extraction, data normalization processing, data cleaning, data storage, index selection, call network construction, adjacency matrix construction, nonnegative matrix factorization and result acquisition.

Compared with the prior art, the invention has the following obvious advantages and beneficial effects:

(1) The invention is realized based on the concept of one-side multi-node hypergraph, the property of the hypergraph can enable the model to obtain the approximate value of the global optimal solution, and the current group fraud identification scheme is mostly based on the algorithm of the traditional graph and can only consider the local optimal solution. And the sphere of detection of the group is more dependent on the global optimal solution.

(2) For a scheme of high-latitude calculation, a hypergraph regularization term is added into a loss function of non-negative matrix factorization, so that high-dimensional information can be encoded, and iteration efficiency is improved.

Drawings

Figure 1 is a specific flow chart of a method according to the invention.

FIG. 2 is a schematic diagram of encoding node information in a hypergraph high-dimensional space in a new space according to the present invention.

FIG. 3 is a flow chart of an embodiment implementation of model construction and specific practice.

Detailed Description

The present invention will be described in detail below with reference to the drawings and examples.

The technical scheme adopted by the invention is a financial fraud partner identification method based on hypergraph segmentation, which comprises the following steps of S1, feature extraction: extracting factors such as contact person ID, call duration, call times and the like from the original data of the call records to form a JSON string;

step S3, data cleaning: deleting call records such as harassment, express delivery, meal delivery, promotion, invalid number, service number and the like;

step S6: constructing an adjacency matrix: calculating weights according to the indexes in the S5 through AHP, and constructing an adjacency matrix of the undirected graph according to the weights;

step S7: non-negative matrix factorization: the critical matrix a in S6 is decomposed into: a is approximately equal to WH ^T The method comprises the steps of carrying out a first treatment on the surface of the Wherein,different from the traditional meaning, one edge only has two node diagrams, all nodes of the hypergraph are in a high-dimensional space, the hypergraph regular term is added to encode the high-order relation of nonnegative matrix factorization, and the loss function is: />

Step S8: the result is obtained: according to the loss function in S7, iterating, updating the rule through W and H, converging the function, and according toAnd obtaining a result of the community detection algorithm.

Finally, it should be noted that: the above examples are only for illustrating the invention and are not intended to limit the technical solutions described by the invention; thus, while the invention has been described in detail with reference to the examples set forth above, it will be appreciated by those skilled in the art that modifications and equivalents may be made thereto; all technical solutions and modifications thereof that do not depart from the spirit and scope of the invention are intended to be covered by the scope of the appended claims.

Claims

1. A financial fraud partner identification method based on hypergraph segmentation is characterized in that: comprises the steps of,

step S1, feature extraction: the contact person ID, the call time length and the call frequency factor are extracted from the original data of the call record to form a JSON string form, so that the subsequent data processing is convenient;

step S3, data cleaning: deleting harassment, express delivery, meal delivery, promotion, invalid number and service number call records; the interference call records can promote the contact person IDs which are not related to each other to be directly clustered, so that the result of cluster recognition is interfered;

step S5, index selection and call network construction: taking the contact person ID as a node of the network, taking the call association as a side of the network, and selecting call times, duration, information entropy and time interval indexes to finish construction of a call information network;

step S6: construction of an adjacency matrix A: calculating weights according to the indexes in the S5, and constructing an adjacency matrix of the undirected graph according to the weights, wherein the weights E [0,1];

step S7: non-negative matrix factorization: decomposing the adjacency matrix A in S6 into: a is approximately equal to WH ^T The method comprises the steps of carrying out a first treatment on the surface of the Wherein,on the high-dimensional space, adding hypergraph regularization terms to all nodes of the hypergraph encodes the higher-order relation of non-negative matrix factorization, and the loss function: />Where k represents the number of divided communities, a _ij Representing node v _i And node v _j Probability of connection, w _il And h _hl Representing v _i Probability that the degree of entry and the degree of exit belong to community l; w= [ W ] _il ]∈R ^n×k And H= [ H ] _il ]∈R ⁿ ^×k Is a non-negative matrix; z _ij Representing node v _i And node v _j Is of the order of (1), Z= [ Z ] _il ]∈R ^n×n Is reliable a priori information if v _i And v _j Z is not related to _ij =0, if v _i And v _j Belonging to the same community, z _ij =1 and h _i And h _j Is approximately equal, λ is the adjustment parameter between the regularization term and the loss function;

step S8: the result is obtained: according to the loss function in S7, iterating, updating the rule through W and H, converging the function, and according toObtaining a result of a community detection algorithm; according to the characteristic that the fraudulent group has clustering in the call network, the clustered nodes are identified as members of the fraudulent group.

2. The method for identifying financial fraud partners based on hypergraph segmentation as defined in claim 1, wherein: and obtaining index information used by the financial fraud partner identification method based on hypergraph segmentation by carrying out feature development on the original data.

3. The method for identifying financial fraud partners based on hypergraph segmentation as defined in claim 1, wherein: the construction of the anti-fraud engine plays a key role through the construction of the call information network.

4. The method for identifying financial fraud partners based on hypergraph segmentation as defined in claim 1, wherein: considering the traditional graph of different paired nodes, the hypergraph concept of one-side multi-node is proposed.