CN111833174A

CN111833174A - Internet financial application anti-fraud identification method based on LOF algorithm

Info

Publication number: CN111833174A
Application number: CN202010493203.4A
Authority: CN
Inventors: 江远强
Original assignee: Baiweijinke Shanghai Information Technology Co ltd
Current assignee: Baiweijinke Shanghai Information Technology Co ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-10-27

Abstract

The invention provides an Internet financial application anti-fraud identification method based on an LOF algorithm, which comprises the steps of collecting data and preprocessing the data; selecting data characteristics to obtain a data set of an LOF algorithm and randomly dividing the data set into different data subsets; calculating local reachable distance, local density reachable density and local outlier LOF value of the data points; the LOF value is used to determine whether the data point is an outlier as to whether the requested action is fraudulent. By implementing the technical scheme of the invention, the running time of abnormal point detection is effectively shortened, the efficiency of abnormal value detection of the high-dimensional large data set is improved, the Internet application behaviors can be monitored in real time, the abnormal application fraud behaviors can be timely and accurately detected and found, the credit loss is reduced, and the method and the system are more suitable for the current requirements of large data wind control.

Description

Internet financial application anti-fraud identification method based on LOF algorithm

Technical Field

The invention relates to the technical field of wind control in the Internet financial industry, in particular to a wind control system.

Background

Along with the development of internet finance, the types and modes of fraud behaviors such as grey products, black products and the like are more and more, according to incomplete statistics, the loss caused by fraud can reach 500 to 1000 billion every year, and the fraud risk becomes the important factor of internet finance prevention risk. Statistically, fraud belongs to outliers relative to normal behavior, and in a scatter plot of data, their attribute values are far from other data points, and significantly deviate from expected or common attribute values, and outlier detection is a common method for financial anti-fraud, and how to effectively detect fraud at a high probability becomes the main work of anti-fraud of large financial institutions.

In the prior art, there are three main methods for outlier detection: an outlier detection method based on statistics (HBOS: histogram-based outlier score), an outlier detection method based on distance (such as K nearest neighbor KNN), an outlier detection method based on clustering (such as K-means clustering K-means and DBSCAN) and the like, but the algorithms in the prior art are complex, large in computation amount, large in time complexity, low in precision and the like, and the detection efficiency for high-dimensional and large data is low. How to reduce the calculation amount and the operation time of outlier detection becomes a technical problem to be solved urgently.

The LOF algorithm (Local Outlier Factor) is an abnormal data detection method based on density, and introduces the concepts of the reachable distance and the reachable density of each data object to judge whether one data object is an Outlier or not, calculates a Local abnormal Factor LOF for each data in a data set to reflect the abnormal degree of one data, because the LOF algorithm calculates the density by the kth neighborhood of the point, only carries out mining on the Outlier of a boundary unit where the Outlier is likely to appear, but not carries out global calculation, and can accurately find the Outlier under the condition that the sample space data is not uniformly distributed, thereby effectively reducing the data volume, the calculated amount and the running time length of the Outlier to be detected, having higher detection efficiency for high-dimensional large data, and being more suitable for the current large data pneumatic control requirement.

Disclosure of Invention

In order to solve the technical problem, the invention discloses an internet financial application anti-fraud identification method based on an LOF algorithm, and the technical scheme of the invention is implemented as follows:

an Internet financial application anti-fraud identification method based on an LOF algorithm comprises the following steps: the method comprises the following steps: collecting operation buried point data, personal basic information and client authorized third party data which are submitted by a client on a client; step two: data preprocessing, including abnormal value processing and normalization processing; step three: selecting data characteristics according to behavior characteristic types of credit fraud to obtain a data set of an LOF algorithm, and randomly dividing the data set into different data subsets; step four: based on the data subset, calculating the Kth distance field of the object p in the data subset through an LOF algorithm, and then calculating the local reachable distance of the object p; step five: calculating the local reachable density of the object p according to the local reachable distance; step six: calculating the LOF value of the local abnormal factor of the object p according to the local reachable density; step seven: and a recursion step I to a step six, wherein in the loop calculation, the obtained LOF value is compared with a set threshold psi, the object with the LOF value smaller than the threshold psi is judged as a normal point, the object is continuously removed, the object with the LOF value larger than the threshold psi is judged as an abnormal point, and the abnormal point is output.

Further, the outlier processing includes culling data of the extraneous dimension and deleting outliers in the data.

Further, the normalization process adopts a dispersion normalization method.

Further, the kth distance domain, the local reachable distance, and the local reachable density are only calculated in the data subset where the object p is located.

Further, the threshold ψ is dynamically set and adjusted depending on empirical values or actual traffic variations.

According to the technical scheme, in the anti-fraud identification of the Internet financial application based on the LOF algorithm, the outlier threshold psi is set according to experience and actual business, non-outliers with high density and outliers with high probability of outputting the outliers are continuously removed in recursive computation, the running time of outlier detection is effectively shortened, the efficiency of detecting outliers of high-dimensional large data sets is improved, the Internet application behavior can be monitored in real time, the application abnormal fraud behavior can be timely and accurately detected, and credit loss is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only one embodiment of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An Internet financial application anti-fraud identification method based on an LOF algorithm comprises the following steps: the method comprises the following steps: collecting operation buried point data, personal basic information and client authorized third party data which are submitted by a client on a client; step two: data preprocessing, including abnormal value processing and normalization processing; step three: selecting data characteristics according to behavior characteristic types of credit fraud to obtain a data set of an LOF algorithm, and randomly dividing the data set into different data subsets; step four: based on the data subset, calculating the Kth distance field of the object p in the data subset through an LOF algorithm, and then calculating the local reachable distance of the object p; step five: calculating the local reachable density of the object p according to the local reachable distance; step six: calculating the LOF value of the local abnormal factor of the object p according to the local reachable density; step seven: and in the loop calculation, comparing the obtained LOF value with a set threshold psi, determining the object with the LOF value smaller than the threshold psi as a normal point, continuously eliminating the object with the LOF value larger than the threshold psi as an abnormal point, and outputting the abnormal point.

In the embodiment, data can be acquired through the flow acquisition equipment deployed on the network node, and the acquired data characteristics can comprehensively reflect the comprehensive conditions of the repayment capacity and the repayment willingness of the application user; the personal basic information includes traditional data such as personal and family status, work and income levels, etc.

In this embodiment, the data set of the LOF algorithm is divided into different data sets, including a training set and a verification set, in the high-dimensional data set, some data dimensions are divided into n segments, the data set is divided along a dividing point connecting line labeled by each dimension, the divided irregular section is a grid boundary, and a specific boundary value of the grid boundary needs to be determined according to the dimensions and the size of the data set and a given dividing interval n.

In this embodiment, the subdata set in which the object p is located is defined as p_i(ii) a The distance d between the object p and its k-th nearest neighbor_k(p) then there are at least k objects o_iSatisfy d (o)_i,p)≤d(o_kP), there are at most k-1 objects o_jAnd satisfies the following conditions: d (o)_j,p)＜d(o_kP); the k neighbor of the object p is represented by the distance between all the k neighbors and the object p being less than d_k(p) and then averaging the distances from the object p to k neighbors, i.e., the m-distance of p, the calculation formula is:

the m-neighbors of object p represent the set of all objects whose distance from p is less than m, the reachable distance reach _ dist of object p with respect to object o_m(o, p) represents the maximum of the m-distance of the object p and the distance between the objects p and o, the local achievable density lrd of the object p_m(p) is the inverse of the average reachable distance from a point within the Kth distance neighborhood of object p to p, then the local reachable density of p lrd_m(p) the value is:

the local anomaly factor for object p is then:

in a preferred embodiment, the outlier processing includes culling data of the extraneous dimension and removing outliers in the data.

In a preferred embodiment, the normalization process uses a dispersion normalization method, and the normalization process enables data to be mapped to [0, 1 ]]In the interval, the dispersion normalization formula is:

wherein x' is the normalized value, x is the data before normalization, x_minIs the minimum value, x, in the feature_maxIs the maximum value in the feature;

in a preferred embodiment, the kth distance domain, the local reachable distance and the local reachable density are calculated only in the subset of data where the object p is located.

In a preferred embodiment, the threshold ψ is dynamically adjusted depending on empirical values or actual traffic variations. The threshold ψ is 1 by default in this embodiment.

It should be understood that the above-described embodiments are merely exemplary of the present invention, and are not intended to limit the present invention, and that any modification, equivalent replacement, or improvement made without departing from the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. An Internet financial application anti-fraud identification method based on an LOF algorithm is characterized by comprising the following steps:

the method comprises the following steps: collecting operation buried point data, personal basic information and client authorized third party data which are submitted by a client on a client;

step two: data preprocessing, including abnormal value processing and normalization processing;

step three: selecting data characteristics according to behavior characteristic types of credit fraud to obtain a data set of an LOF algorithm, and randomly dividing the data set into different data subsets;

step four: based on the data subset, calculating the Kth distance field of the object p in the data subset through an LOF algorithm, and then calculating the local reachable distance of the object p;

step five: calculating the local reachable density of the object p according to the local reachable distance;

step six: calculating the LOF value of the local abnormal factor of the object p according to the local reachable density;

step seven: and a recursion step I to a step six, wherein in the loop calculation, the obtained LOF value is compared with a set threshold psi, the object with the LOF value smaller than the threshold psi is judged as a normal point, the object is continuously removed, the object with the LOF value larger than the threshold psi is judged as an abnormal point, and the abnormal point is output.

2. The method for identifying internet financial application fraud prevention based on LOF algorithm of claim 1, wherein the abnormal value processing includes removing data of irrelevant dimension and deleting abnormal value in data.

3. The method for identifying internet financial application fraud prevention based on LOF algorithm of claim 1, wherein the normalization process adopts a dispersion normalization method.

4. The method for recognizing internet financial application fraud prevention based on LOF algorithm of claim 1, wherein the Kth distance field, the local reachable distance and the local reachable density are calculated only in the data subset where the object p is located.

5. The method for identifying internet financial application fraud prevention based on LOF algorithm of claim 1, wherein the threshold ψ is dynamically set and adjusted depending on empirical values or actual traffic variation.