CN110188198B - Anti-fraud method and device based on knowledge graph - Google Patents

Anti-fraud method and device based on knowledge graph Download PDF

Info

Publication number
CN110188198B
CN110188198B CN201910415531.XA CN201910415531A CN110188198B CN 110188198 B CN110188198 B CN 110188198B CN 201910415531 A CN201910415531 A CN 201910415531A CN 110188198 B CN110188198 B CN 110188198B
Authority
CN
China
Prior art keywords
enterprise
data
attribute
probability
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910415531.XA
Other languages
Chinese (zh)
Other versions
CN110188198A (en
Inventor
窦志成
姜涛
韩维思
黄真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yilanqunzhi Data Technology Co ltd
Original Assignee
Beijing Yilanqunzhi Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yilanqunzhi Data Technology Co ltd filed Critical Beijing Yilanqunzhi Data Technology Co ltd
Priority to CN201910415531.XA priority Critical patent/CN110188198B/en
Publication of CN110188198A publication Critical patent/CN110188198A/en
Application granted granted Critical
Publication of CN110188198B publication Critical patent/CN110188198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses an anti-fraud method and an anti-fraud device based on a knowledge graph, wherein the method comprises the following steps: extracting entities, entity attribute data and relationship data from a data source; screening and processing the entity attribute data, and constructing a knowledge graph by using the processed entity attribute data and the relationship data, wherein the knowledge graph comprises first class nodes and second class nodes, the first class nodes are nodes of known labels, and the second class nodes are nodes of labels to be predicted; predicting labels for the second class of nodes based on the knowledge-graph.

Description

Anti-fraud method and device based on knowledge graph
Technical Field
The present application relates to an anti-fraud technology, and in particular, to an anti-fraud method and apparatus based on a knowledge graph.
Background
A great deal of labor and time cost is needed to be consumed when a traditional loan-putting institution in the banking industry and the like evaluates, surveys and declares a loan enterprise, so that the financing period of the enterprise is long and the financing cost is high. Many small enterprises actively claim when applying for loans, and the data contains a lot of false information, which results in that institutions cannot correctly evaluate the loan risk.
Disclosure of Invention
In order to solve the technical problem, the embodiment of the application provides an anti-fraud method and an anti-fraud device based on a knowledge graph.
The anti-fraud method based on the knowledge graph provided by the embodiment of the application comprises the following steps:
extracting entities, entity attribute data and relationship data from a data source;
screening and processing the entity attribute data, and constructing a knowledge graph by using the processed entity attribute data and the relationship data, wherein the knowledge graph comprises first class nodes and second class nodes, the first class nodes are nodes of known labels, and the second class nodes are nodes of labels to be predicted;
predicting labels for the second class of nodes based on the knowledge-graph.
In one embodiment, the entity is a business; accordingly, the number of the first and second electrodes,
the entity attribute data comprises enterprise information and personal customer information;
the relationship data includes at least one of: the correspondence between an enterprise and an individual, the correspondence between an individual and an individual, the correspondence between an enterprise and a related attribute, the correspondence between an individual and a related attribute, and the correspondence between an enterprise and an enterprise.
In one embodiment, prior to constructing the knowledge-graph, the method further comprises:
and reducing the relation data to enable each relation to correspond to the enterprise.
In one embodiment, the personal client information comprises enterprise real-time control person information and a plurality of enterprise affiliate information;
the screening and processing of the entity attribute data includes:
aggregating the information of the plurality of enterprise main system people to obtain main system people aggregation characteristics;
associating the real control person characteristics and the trunk person aggregation characteristics to an enterprise to obtain enterprise sample data;
performing at least one of the following processes on the enterprise sample data: abnormal value processing, missing value processing, analysis of correlation between variables, and encoding of category variables.
In one embodiment, the constructing a knowledge graph using the processed entity attribute data and the relationship data includes:
taking enterprises as nodes of the knowledge graph;
the processed enterprise information, the enterprise real control person information and the aggregation of the enterprise main system person information are used as the attribute of the node together;
taking reduction relations among the enterprises as relations of the knowledge graph;
and deleting isolated nodes existing in the knowledge graph.
In one embodiment, the predicting labels for the second class of nodes based on the knowledge-graph includes:
s1: training by using known enterprise attribute characteristics of the real label to obtain a fraud prediction model local _ classifier;
s2: extracting an enterprise attribute feature matrix with known real labels from the knowledge graph, calculating the proportion of positive samples in a one-degree neighborhood of each enterprise node in a training data set according to the node association in the knowledge graph, splicing the value to the enterprise attribute feature, and then training again to obtain a fraud prediction model relationship _ classifier added with neighbor label information;
s3: inputting the attribute characteristics of the enterprise with unknown labels into the model trained in S1 to obtain a primary fraud risk probability pos _ probability, and storing the probability value as the attribute of the enterprise node to be predicted in the knowledge graph;
s4: setting a predefined maximum iteration round number N, initializing an iteration number i to be 1, and setting the number of enterprise nodes to be estimated to be M; n, i and M are positive integers;
s5: calculating the fraud-free probability neg _ probability of each enterprise node to be predicted as 1-pos _ probability, and making a difference between the pos _ probability and the neg _ probability to obtain confidence, and sequencing absolute values of the confidence;
s6: selecting the first i x M/N enterprises for pre-classification, setting the pre-estimated label of the prediction sample to be positive if the confidence is greater than 0, setting the pre-estimated label of the prediction sample to be negative if the confidence is less than or equal to 0, and writing the pre-estimated label back to the pred attribute storage in the knowledge graph;
s7, calculating the positive sample proportion of the surrounding one-degree neighbors of each enterprise node to be predicted, if the surrounding neighbor nodes determine the label of the testing node in the training sample or the previous round, adding calculation, and splicing the calculation result to the attribute data of the enterprise;
s8: classifying by using a relation _ classifier in S2 to obtain fraud risk probability pos _ probability of the node in the round, and writing back the probability pos _ probability to the knowledge graph to update the attribute value;
s9: the iteration times are i +1, and the iterations S5, S6, S7 and S8 are repeated; and the iteration ending mark is i > N or the current round of prediction result is the same as the previous round of prediction result.
The anti-fraud device based on knowledge graph that this application embodiment provided includes:
the extracting unit is used for extracting the entity, the entity attribute data and the relation data from the data source;
the processing unit is used for screening and processing the entity attribute data;
the system comprises a graph construction unit and a prediction unit, wherein the graph construction unit is used for constructing a knowledge graph by using processed entity attribute data and the relationship data, the knowledge graph comprises first class nodes and second class nodes, the first class nodes are nodes of known labels, and the second class nodes are nodes of labels to be predicted;
and the prediction unit is used for predicting the label of the second class node based on the knowledge graph.
In one embodiment, the entity is a business; accordingly, the number of the first and second electrodes,
the entity attribute data comprises enterprise information and personal customer information;
the relationship data includes at least one of: the correspondence between an enterprise and an individual, the correspondence between an individual and an individual, the correspondence between an enterprise and a related attribute, the correspondence between an individual and a related attribute, and the correspondence between an enterprise and an enterprise.
In one embodiment, the apparatus further comprises:
and the reduction unit is used for reducing the relationship data so that each relationship corresponds to an enterprise.
In one embodiment, the personal client information comprises enterprise real-time control person information and a plurality of enterprise affiliate information;
the processing unit is used for aggregating the information of the plurality of enterprise system people to obtain the aggregation characteristics of the system people; associating the real control person characteristics and the trunk person aggregation characteristics to an enterprise to obtain enterprise sample data; performing at least one of the following processes on the enterprise sample data: abnormal value processing, missing value processing, analysis of correlation between variables, and encoding of category variables.
In one embodiment, the map construction unit is configured to:
taking enterprises as nodes of the knowledge graph;
the processed enterprise information, the enterprise real control person information and the aggregation of the enterprise main system person information are used as the attribute of the node together;
taking reduction relations among the enterprises as relations of the knowledge graph;
and deleting isolated nodes existing in the knowledge graph.
In one embodiment, the prediction unit is configured to perform the following steps:
s1: training by using known enterprise attribute characteristics of the real label to obtain a fraud prediction model local _ classifier;
s2: extracting an enterprise attribute feature matrix with known real labels from the knowledge graph, calculating the proportion of positive samples in a one-degree neighborhood of each enterprise node in a training data set according to the node association in the knowledge graph, splicing the value to the enterprise attribute feature, and then training again to obtain a fraud prediction model relationship _ classifier added with neighbor label information;
s3: inputting the attribute characteristics of the enterprise with unknown labels into the model trained in S1 to obtain a primary fraud risk probability pos _ probability, and storing the probability value as the attribute of the enterprise node to be predicted in the knowledge graph;
s4: setting a predefined maximum iteration round number N, initializing an iteration number i to be 1, and setting the number of enterprise nodes to be estimated to be M; n, i and M are positive integers;
s5: calculating the fraud-free probability neg _ probability of each enterprise node to be predicted as 1-pos _ probability, and making a difference between the pos _ probability and the neg _ probability to obtain confidence, and sequencing absolute values of the confidence;
s6: selecting the first i x M/N enterprises for pre-classification, setting the pre-estimated label of the prediction sample to be positive if the confidence is greater than 0, setting the pre-estimated label of the prediction sample to be negative if the confidence is less than or equal to 0, and writing the pre-estimated label back to the pred attribute storage in the knowledge graph;
s7, calculating the positive sample proportion of the surrounding one-degree neighbors of each enterprise node to be predicted, if the surrounding neighbor nodes determine the label of the testing node in the training sample or the previous round, adding calculation, and splicing the calculation result to the attribute data of the enterprise;
s8: classifying by using a relation _ classifier in S2 to obtain fraud risk probability pos _ probability of the node in the round, and writing back the probability pos _ probability to the knowledge graph to update the attribute value;
s9: the iteration times are i +1, and the iterations S5, S6, S7 and S8 are repeated; and the iteration ending mark is i > N or the current round of prediction result is the same as the previous round of prediction result.
According to the technical scheme, the high-risk enterprises can be identified and filtered by the aid of the enterprise relation maps. By integrating enterprise business and enterprise backbone data and considering enterprises associated with target enterprises, the association relationship between the enterprises and the enterprises is finally described. The technical scheme of the embodiment of the application is beneficial to identifying fraud cases such as group fraud, group black involvement, loan cheating and the like, can comprehensively evaluate the risk condition of loan application enterprises, prevents hidden fraud in advance and blocks loan paths. Besides the enterprise fraud risk identification, the constructed multiple relationship maps can be used for performing visual analysis and mining on stockholder holding structure, case concerning and complaint, high-management relationship, relationship of relatives and the like.
Drawings
FIG. 1 is a schematic flow chart of a knowledge-graph-based anti-fraud method provided by an embodiment of the present application;
FIG. 2 is a logic diagram of a risk probability prediction algorithm provided by an embodiment of the present application;
fig. 3 is a schematic structural composition diagram of an anti-fraud apparatus based on a knowledge graph according to an embodiment of the present application.
Detailed Description
In order to facilitate understanding of the technical solutions of the embodiments of the present application, the following description is made of related art of the embodiments of the present application.
The current credit anti-fraud implementation route can be summarized into the following four technical means: black and white list based, rule engine based, supervised learning, and unsupervised learning.
The black and white list is the simplest and original anti-fraud means, and the newly applied client and the historical black list data are inquired and matched to achieve the purpose of filtering and screening fraudulent users.
The rule engine originates from a rule-based expert system, is used for simulating human behaviors to realize computer automatic decision making, and is a starting and triggering mechanism designed for single or combined fraudulent behaviors based on the full cognition of the characteristics and the mode of the fraudulent behaviors.
Supervised learning is the most widely used machine learning method in current anti-fraud detection. The method needs to collect known fraud data and normal data to be used as a training set, and the trained machine learning model analyzes the hidden layer relation among the characteristics through abstract understanding of the characteristics of the user to fill and enhance the complex fraud behaviors which cannot be covered by a rule engine.
Unsupervised learning is an anti-fraud strategy that has emerged in recent years. The detection algorithm does not need to rely on any label to carry out model training, finds common abnormality among fraudulent user behaviors through correlation analysis and similarity analysis, creates a cluster group, and discovers unknown fraudulent behaviors in one or more groups.
The four technical means have the following problems respectively: 1. although the black and white list is simple and easy to use, the accumulation time is long, and the purchase cost is high. The effectiveness of identifying fraudsters is limited by the size and source of the blacklist and has a natural lag in the time dimension, making it difficult to contain fraud cases in advance. 2. The rule engine based on expert experience has the advantage of simple configuration, but the rule making and updating are based on business experience, and certain misjudgment risk exists. The rules engine cannot detect new patterns of fraud despite being able to identify new fraudsters. Due to the limited time of action of the rules, the rules engine needs to spend a lot of operating resources, time and expense to maintain. 3. Although supervised learning, which is widely used at present, avoids the interference of human experience, collecting sufficient training data and accurate labeling data adds a certain limitation to supervised machine learning. Most machine learning models, especially logistic regression models commonly used in the financial industry, require long training times and are therefore difficult to cope with variable fraudulent activities. In addition, the supervised machine learning mode in the traditional sense is mostly suitable for independently distributed data, namely, the features between the samples are not correlated and interdependent. In an anti-fraud scenario, hidden relations between enterprises often contain unknown potential information, and it is a challenge for traditional supervised learning to predict the enterprises by using the information of the related enterprises. 4. Although unsupervised learning does not need a large number of manual label determination processes, clustering results still need to be discriminated by service experts in combination with domain knowledge, and no clear standard or evaluation is provided for the quality of clustering.
The technical scheme of the embodiment of the application aims to combine the disclosed enterprise fraud records, integrate a plurality of authoritative data sources, actively construct a relationship map between enterprises, comprehensively evaluate the operation condition of unknown target enterprises, quantify enterprise fraud risks and help lending institutions to rapidly make wind control strategies.
So that the manner in which the features and elements of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. The following description will first explain related concepts related to the embodiments of the present application:
entity: things that are distinguishable and exist independently. The anti-fraud map constructed in the present application contains only one entity, namely an enterprise.
The relationship is as follows: the association between entities. Such as: the user can control the user and the user can contact with the phone.
The attributes are as follows: attributes are descriptions of entities and relationships. Entities typically have attributes such as business data for the enterprise, etc. Relationships may also have attributes, such as weights on the relationships.
First degree association (first degree neighbor): a node directly connected to the target node.
An enterprise is a dry person: actually control people, legal people, high pipe and stockholder.
AUM: one consideration of the bank to the customer is to measure the contribution of the customer to the bank.
LightGBM (light Gradient Boosting machine), a framework for implementing GBDT (Gradient Boosting Decision Tree) algorithm of open source, and supports high-efficiency parallel training.
Fig. 1 is a schematic flow chart of an anti-fraud method based on a knowledge-graph according to an embodiment of the present application, and as shown in fig. 1, the anti-fraud method based on the knowledge-graph includes the following steps:
step 101: entities, entity attribute data, and relationship data are extracted from a data source.
All data sources used in this application are from enterprise data, personal customer data, and external three-party data provided by the organization. The extraction of data can be divided into extraction of entities, attributes and extraction of relationships. In an alternative embodiment, the data extraction time range is determined in a business where the business loan application time is from time t1 to time t2 and there is a repayment performance, such as: the extraction time range of the data is determined in enterprises with repayment performance, wherein the enterprise loan application time is between 2018 and 2018, namely 1 and 12 months.
A) Extraction of entities and attributes
The application models at enterprise granularity, and each entity is an enterprise. The entity attributes are composed of attribute information of the enterprise itself and attribute information of individual clients related to the enterprise. The attribute information of the enterprise (referred to as enterprise data or enterprise information for short) includes but is not limited to basic information such as a technical number, business data, a contact telephone, a registration address, a category of the affiliated industry, a formation date and the like of the enterprise, and enterprise deposit data, transfer data and loan data. The attribute information of the enterprise-related personal client (referred to as personal client data or personal client information) includes but is not limited to basic information such as personal technical number, gender, age, academic calendar, living condition, occupation, job title, marital condition, and child condition, as well as personal deposit data and loan data.
B) Extraction of relationships
The data tables involved in extracting relationships can be summarized into the following five categories: 1. the corresponding relationship between the enterprise and the individual (real control relationship table, high management relationship table, legal relationship table, investment control stock relationship table). 2. The correspondence relationship between individuals (direct relationship table, spouse relationship table). 3. The corresponding relation between the enterprise and the related attribute (enterprise address relation table and enterprise telephone number relation table). 4. The correspondence between the individual and the related attribute (individual address relationship table, individual telephone number relationship table, individual device usage relationship table). 5. The correspondence between the enterprises (enterprise guaranty relationship table).
In the embodiment of the present application, the relationship data needs to be reduced, so that each relationship corresponds to an enterprise. Specifically, the original enterprise relationship information has multiple sources, data is heterogeneous and fragmented. In order to ensure that the whole enterprise relationship network is isomorphic, namely the knowledge graph entities are uniform, the method reduces the relationships according to the following mode 1, and ensures that each relationship corresponds to an enterprise.
Figure BDA0002064217420000081
Figure BDA0002064217420000091
TABLE 1
Step 102: and screening and processing the entity attribute data.
A) Enterprise data
The attribute data of the enterprise comes from the enterprise information extracted from the data source. The method specifically comprises basic information such as enterprise technical numbers, industrial and commercial data, registration addresses, affiliated industry categories, establishment dates, registration dates and the like, and enterprise deposit data, transfer data and loan data. The enterprise business data comprises five items of registered fund amount, annual inspection asset total amount, client profit total amount, sales or business income and net asset total amount of the enterprise. The enterprise deposit data comprises the enterprise deposit balance at the data interception moment and the deposit month, season and year number. The enterprise transfer data includes the total number of transfers (roll-in, roll-out) and the total transfer amount within the year. The enterprise loan data comprises loan application times, loan application refusal times and overdue conditions (overdue principal, interest and days) in one year.
B) Data of enterprise backbone people
The fraud cases are not sufficiently predicted by simply using the enterprise related data, so that enterprise-based multi-dimensional characteristic data is established for each enterprise by matching the related information of the real control person and other affiliates while using the enterprise attribute data, and the characterization capability of the overall data is enhanced. Each enterprise has a unique real control person and a plurality of other affiliates, and the real control person is more closely associated with the enterprise than the other affiliates. Therefore, the information of the enterprise real control person and the enterprise information are spliced separately, and the information of other real control persons (namely, the affiliates) of the enterprise is aggregated to further expand the enterprise characteristics. The information of the entity and the information of other affiliates are all from the extracted personal client data, and specifically comprise personal technical numbers, sexes, ages, academic records, living conditions, careers, jobs, titles, marital conditions, child conditions, personal deposit data and loan data. The personal client deposit data comprises personal deposit balance at the data interception moment, deposit month, season and year number, and personal time point, month average and year average AUM value. Loan data includes the number of loan applications and overdue conditions (number of consecutive delinquent episodes, overdue principal, interest, maximum number of delinquent days) over the course of the year.
When the aggregation is carried out on other affiliates of an enterprise, the aggregation functions selected for different variables comprise:
-a numerical variable: respectively selecting maximum value, sum, median and mean
-a categorical variable: selecting a mode
Finally, the processed real control person characteristics and the relation person aggregation characteristics are related to the enterprise, so that one sample data (enterprise) corresponds to one record. And then processing the enterprise sample data by at least one of the following steps: abnormal value processing, missing value processing, analysis of correlation between variables, and encoding of category variables. For example: deleting the characteristics with the deletion rate of more than 80 percent or the Pearson coefficient of more than 0.98, and taking the residual characteristics as the enterprise attribute characteristics of model training.
Step 103: and constructing a knowledge graph by using the processed entity attribute data and the relationship data, wherein the knowledge graph comprises a first class node and a second class node, the first class node is a node of a known label, and the second class node is a node of a label to be predicted.
The knowledge graph in the embodiment of the application is also called as an anti-fraud enterprise knowledge graph, and the specific construction process of the knowledge graph is as follows:
A) and taking the enterprises as nodes of the knowledge graph.
Specifically, the enterprises submitting credit applications are taken as node entities of the graph, wherein people or things used for establishing the relationships are reduced in the relationship construction process;
B) and the processed enterprise information, the enterprise real-control person information and the aggregation of the enterprise main system person information are used as the attributes of the nodes together.
Specifically, the aggregation of the basic information of the enterprise, the basic information of the entity controller, and the basic information of the affiliate processed in step 102 is taken together as the attribute of the entity.
C) And taking the reduction relation among the enterprises as the relation of the knowledge graph.
Specifically, the various reduction relationships between the enterprises in step 101 are taken as the relationships of the graph.
D) And deleting isolated nodes existing in the knowledge graph.
In the embodiment of the application, due to the fact that accurate definition of enterprise fraud is lacked in historical data, an enterprise fraud label is established according to serious violation records of enterprises and enterprise affiliates disclosed in organizations or related departments within a period of time, and the label is used as a target variable. Relevant business and personal critical violation data include, but are not limited to: 1. fraud lists in intra-agency fraud systems; 2. administrative violation records of enterprises and individuals and blacklists of criminal suspects; 3. the cross-platform number loan is greater than N, for example 4.
Step 104: predicting labels for the second class of nodes based on the knowledge-graph.
In the embodiment of the present application, based on the knowledge graph, predicting the label of the second type node may be defined as the following problem: the enterprise relationship graph constructed based on the step 103 is marked as G, all nodes on G are marked as V, X is the self attribute vector of the enterprise in the step 1, Vk is the known label node in G, and Vu is the node to be predicted in the enterprise graph G. Knowing the graph structure G, the attributes X carried by all nodes (enterprises) and part of the known label nodes Vk, the label type of Vu is predicted by using the information. The corresponding pseudo-code is as follows:
Figure BDA0002064217420000111
Figure BDA0002064217420000121
referring to fig. 2, the pseudo code may implement the following steps:
s1: training by using enterprise attribute characteristics known by a real label, and training by using a Microsoft open-source LightGBM frame to obtain a fraud prediction model local _ classifier only depending on the attributes of the enterprise;
s2: extracting an enterprise attribute feature matrix with known real labels from the knowledge graph, calculating the proportion of positive samples in a one-degree neighborhood of each enterprise node in a training data set according to the node correlation in the knowledge graph, splicing the value to the enterprise attribute feature, and then training through a LightGBM frame again to obtain a fraud prediction model relationship _ classifier added with neighbor label information;
s3: inputting the attribute characteristics of the enterprise with unknown labels into the model trained in S1 to obtain a primary fraud risk probability pos _ probability, and storing the probability value as the attribute of the enterprise node to be predicted in the knowledge graph;
s4: setting a predefined maximum iteration round number N, initializing an iteration number i to be 1, and setting the number of enterprise nodes to be estimated to be M; n, i and M are positive integers;
s5: calculating the fraud-free probability neg _ probability of each enterprise node to be predicted as 1-pos _ probability, and making a difference between the pos _ probability and the neg _ probability to obtain confidence, and sequencing absolute values of the confidence;
s6: selecting the first i x M/N enterprises for pre-classification, setting the pre-estimated label of the prediction sample to be positive if the confidence is greater than 0, setting the pre-estimated label of the prediction sample to be negative if the confidence is less than or equal to 0, and writing the pre-estimated label back to the pred attribute storage in the knowledge graph;
s7, calculating the positive sample proportion of the surrounding one-degree neighbors of each enterprise node to be predicted, if the surrounding neighbor nodes determine the label of the testing node in the training sample or the previous round, adding calculation, and splicing the calculation result to the attribute data of the enterprise;
s8: classifying by using a relation _ classifier in S2 to obtain fraud risk probability pos _ probability of the node in the round, and writing back the probability pos _ probability to the knowledge graph to update the attribute value;
s9: the iteration times are i +1, and the iterations S5, S6, S7 and S8 are repeated; and the iteration ending mark is i > N or the current round of prediction result is the same as the previous round of prediction result.
It should be noted that the LightGBM algorithm in the embodiment of the present application may be replaced by any machine learning algorithm that can output a probability, including but not limited to, algorithms such as a logic Regression, a Random Forest, an XGBoost, and a GBDT.
Fig. 3 is a schematic structural composition diagram of an anti-fraud apparatus based on a knowledge-graph according to an embodiment of the present application, and as shown in fig. 3, the anti-fraud apparatus based on a knowledge-graph includes:
an extracting unit 301, configured to extract an entity, entity attribute data, and relationship data from a data source;
a processing unit 302, configured to filter and process the entity attribute data;
a graph constructing unit 303, configured to construct a knowledge graph using the processed entity attribute data and the relationship data, where the knowledge graph includes first class nodes and second class nodes, the first class nodes are nodes of known labels, and the second class nodes are nodes of labels to be predicted;
a prediction unit 304, configured to predict labels of the second class nodes based on the knowledge-graph.
In one embodiment, the entity is a business; accordingly, the number of the first and second electrodes,
the entity attribute data comprises enterprise information and personal customer information;
the relationship data includes at least one of: the correspondence between an enterprise and an individual, the correspondence between an individual and an individual, the correspondence between an enterprise and a related attribute, the correspondence between an individual and a related attribute, and the correspondence between an enterprise and an enterprise.
In one embodiment, the apparatus further comprises:
and a reduction unit (not shown in the figure) for reducing the relationship data so that each relationship corresponds to an enterprise.
In one embodiment, the personal client information comprises enterprise real-time control person information and a plurality of enterprise affiliate information;
the processing unit 302 is configured to aggregate the information of the multiple enterprise affiliates to obtain affiliate aggregation characteristics; associating the real control person characteristics and the trunk person aggregation characteristics to an enterprise to obtain enterprise sample data; performing at least one of the following processes on the enterprise sample data: abnormal value processing, missing value processing, analysis of correlation between variables, and encoding of category variables.
In an embodiment, the map building unit 303 is configured to:
taking enterprises as nodes of the knowledge graph;
the processed enterprise information, the enterprise real control person information and the aggregation of the enterprise main system person information are used as the attribute of the node together;
taking reduction relations among the enterprises as relations of the knowledge graph;
and deleting isolated nodes existing in the knowledge graph.
In an embodiment, the prediction unit 304 is configured to perform the following steps:
s1: training by using known enterprise attribute characteristics of the real label to obtain a fraud prediction model local _ classifier;
s2: extracting an enterprise attribute feature matrix with known real labels from the knowledge graph, calculating the proportion of positive samples in a one-degree neighborhood of each enterprise node in a training data set according to the node association in the knowledge graph, splicing the value to the enterprise attribute feature, and then training again to obtain a fraud prediction model relationship _ classifier added with neighbor label information;
s3: inputting the attribute characteristics of the enterprise with unknown labels into the model trained in S1 to obtain a primary fraud risk probability pos _ probability, and storing the probability value as the attribute of the enterprise node to be predicted in the knowledge graph;
s4: setting a predefined maximum iteration round number N, initializing an iteration number i to be 1, and setting the number of enterprise nodes to be estimated to be M; n, i and M are positive integers;
s5: calculating the fraud-free probability neg _ probability of each enterprise node to be predicted as 1-pos _ probability, and making a difference between the pos _ probability and the neg _ probability to obtain confidence, and sequencing absolute values of the confidence;
s6: selecting the first i x M/N enterprises for pre-classification, setting the pre-estimated label of the prediction sample to be positive if the confidence is greater than 0, setting the pre-estimated label of the prediction sample to be negative if the confidence is less than or equal to 0, and writing the pre-estimated label back to the pred attribute storage in the knowledge graph;
s7, calculating the positive sample proportion of the surrounding one-degree neighbors of each enterprise node to be predicted, if the surrounding neighbor nodes determine the label of the testing node in the training sample or the previous round, adding calculation, and splicing the calculation result to the attribute data of the enterprise;
s8: classifying by using a relation _ classifier in S2 to obtain fraud risk probability pos _ probability of the node in the round, and writing back the probability pos _ probability to the knowledge graph to update the attribute value;
s9: the iteration times are i +1, and the iterations S5, S6, S7 and S8 are repeated; and the iteration ending mark is i > N or the current round of prediction result is the same as the previous round of prediction result.
It will be appreciated by those skilled in the art that the functions performed by the elements of the knowledge-graph based anti-fraud apparatus shown in fig. 3 may be understood by reference to the foregoing description of the knowledge-graph based anti-fraud method. The functions of the units in the knowledge-graph based anti-fraud apparatus shown in fig. 3 may be implemented by a program running on a processor, or may be implemented by specific logic circuits.
The technical solutions described in the embodiments of the present application can be arbitrarily combined without conflict.
In the several embodiments provided in the present application, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims (10)

1. A knowledge-graph based anti-fraud method, characterized in that the method comprises:
extracting entities, entity attribute data and relationship data from a data source;
screening and processing the entity attribute data, and constructing a knowledge graph by using the processed entity attribute data and the relationship data, wherein the knowledge graph comprises first class nodes and second class nodes, the first class nodes are nodes of known labels, and the second class nodes are nodes of labels to be predicted;
predicting labels of the second class nodes based on the knowledge graph, which specifically comprises:
s1: training by using known enterprise attribute characteristics of the real label to obtain a fraud prediction model local _ classifier;
s2: extracting an enterprise attribute feature matrix with known real labels from a knowledge graph, calculating the proportion of a positive sample in a one-degree neighborhood of each enterprise node in a training data set according to the node association in the knowledge graph, splicing the proportion of the positive sample to the enterprise attribute feature, and then training again to obtain a fraud prediction model relation _ classifier added with neighbor label information;
s3: inputting the attribute characteristics of the enterprise with unknown labels into the model trained in S1 to obtain a primary fraud risk probability pos _ probability, and storing the probability value as the attribute of the enterprise node to be predicted in the knowledge graph;
s4: setting a predefined maximum iteration round number N, initializing an iteration number i to be 1, and setting the number of enterprise nodes to be estimated to be M; n, i and M are positive integers;
s5: calculating the fraud-free probability neg _ probability of each enterprise node to be predicted as 1-pos _ probability, and making a difference between the pos _ probability and the neg _ probability to obtain confidence, and sequencing absolute values of the confidence;
s6: selecting the first i x M/N enterprises for pre-classification, setting the estimated label of the prediction sample to be positive if the confidence is greater than 0, setting the estimated label of the prediction sample to be negative if the confidence is less than or equal to 0, and writing the estimated label of the prediction sample back to the pred attribute storage in the knowledge graph;
s7: calculating the positive sample proportion of a one-degree neighbor around each enterprise node to be predicted, if the label of the node is determined in a training sample or the previous round by the surrounding neighbor nodes, adding calculation, and splicing the calculation result to the attribute data of the enterprise;
s8: classifying by using a relation _ classifier in S2 to obtain fraud risk probability pos _ probability of the node in the round, and writing back the fraud risk probability pos _ probability to the knowledge graph to update attribute data;
s9: the iteration times are i +1, and the iterations S5, S6, S7 and S8 are repeated; and the iteration ending mark is i > N or the current round of prediction result is the same as the previous round of prediction result.
2. The method of claim 1, wherein the entity is a business; accordingly, the number of the first and second electrodes,
the entity attribute data comprises enterprise information and personal customer information;
the relationship data includes at least one of: the correspondence between an enterprise and an individual, the correspondence between an individual and an individual, the correspondence between an enterprise and a related attribute, the correspondence between an individual and a related attribute, and the correspondence between an enterprise and an enterprise.
3. The method of claim 1 or 2, wherein prior to constructing the knowledge-graph, the method further comprises:
and reducing the relation data to enable each relation to correspond to the enterprise.
4. The method of claim 2, wherein the personal customer information includes business entity information and a plurality of business affiliate information;
the screening and processing of the entity attribute data includes:
aggregating the information of the plurality of enterprise main system people to obtain main system people aggregation characteristics;
associating the real control person characteristics and the trunk person aggregation characteristics to an enterprise to obtain enterprise sample data;
performing at least one of the following processes on the enterprise sample data: abnormal value processing, missing value processing, analysis of correlation between variables, and encoding of category variables.
5. The method of claim 1 or 4, wherein constructing a knowledge graph using the processed entity attribute data and the relationship data comprises:
taking enterprises as nodes of the knowledge graph;
the processed enterprise information, the enterprise real control person information and the aggregation of the enterprise main system person information are used as the attribute of the node together;
taking reduction relations among the enterprises as relations of the knowledge graph;
and deleting isolated nodes existing in the knowledge graph.
6. A knowledge-graph based anti-fraud apparatus, characterized in that the apparatus comprises:
the extracting unit is used for extracting the entity, the entity attribute data and the relation data from the data source;
the processing unit is used for screening and processing the entity attribute data;
the system comprises a graph construction unit and a prediction unit, wherein the graph construction unit is used for constructing a knowledge graph by using processed entity attribute data and the relationship data, the knowledge graph comprises first class nodes and second class nodes, the first class nodes are nodes of known labels, and the second class nodes are nodes of labels to be predicted;
a prediction unit for performing the steps of:
s1: training by using known enterprise attribute characteristics of the real label to obtain a fraud prediction model local _ classifier;
s2: extracting an enterprise attribute feature matrix with known real labels from a knowledge graph, calculating the proportion of a positive sample in a one-degree neighborhood of each enterprise node in a training data set according to the node association in the knowledge graph, splicing the proportion of the positive sample to the enterprise attribute feature, and then training again to obtain a fraud prediction model relation _ classifier added with neighbor label information;
s3: inputting the attribute characteristics of the enterprise with unknown labels into the model trained in S1 to obtain a primary fraud risk probability pos _ probability, and storing the probability value as the attribute of the enterprise node to be predicted in the knowledge graph;
s4: setting a predefined maximum iteration round number N, initializing an iteration number i to be 1, and setting the number of enterprise nodes to be estimated to be M; n, i and M are positive integers;
s5: calculating the fraud-free probability neg _ probability of each enterprise node to be predicted as 1-pos _ probability, and making a difference between the pos _ probability and the neg _ probability to obtain confidence, and sequencing absolute values of the confidence;
s6: selecting the first i x M/N enterprises for pre-classification, setting the estimated label of the prediction sample to be positive if the confidence is greater than 0, setting the estimated label of the prediction sample to be negative if the confidence is less than or equal to 0, and writing the estimated label of the prediction sample back to the pred attribute storage in the knowledge graph;
s7: calculating the positive sample proportion of a one-degree neighbor around each enterprise node to be predicted, if the label of the node is determined in a training sample or the previous round by the surrounding neighbor nodes, adding calculation, and splicing the calculation result to the attribute data of the enterprise;
s8: classifying by using a relation _ classifier in S2 to obtain fraud risk probability pos _ probability of the node in the round, and writing back the fraud risk probability pos _ probability to the knowledge graph to update attribute data;
s9: the iteration times are i +1, and the iterations S5, S6, S7 and S8 are repeated; and the iteration ending mark is i > N or the current round of prediction result is the same as the previous round of prediction result.
7. The apparatus of claim 6, wherein the entity is a business; accordingly, the number of the first and second electrodes,
the entity attribute data comprises enterprise information and personal customer information;
the relationship data includes at least one of: the correspondence between an enterprise and an individual, the correspondence between an individual and an individual, the correspondence between an enterprise and a related attribute, the correspondence between an individual and a related attribute, and the correspondence between an enterprise and an enterprise.
8. The apparatus of claim 6 or 7, wherein the apparatus further comprises:
and the reduction unit is used for reducing the relationship data so that each relationship corresponds to an enterprise.
9. The apparatus of claim 7, wherein the personal customer information comprises business entity information and a plurality of business affiliate information;
the processing unit is used for aggregating the information of the plurality of enterprise system people to obtain the aggregation characteristics of the system people; associating the real control person characteristics and the trunk person aggregation characteristics to an enterprise to obtain enterprise sample data; performing at least one of the following processes on the enterprise sample data: abnormal value processing, missing value processing, analysis of correlation between variables, and encoding of category variables.
10. The apparatus according to claim 6 or 9, wherein the atlas construction unit is configured to:
taking enterprises as nodes of the knowledge graph;
the processed enterprise information, the enterprise real control person information and the aggregation of the enterprise main system person information are used as the attribute of the node together;
taking reduction relations among the enterprises as relations of the knowledge graph;
and deleting isolated nodes existing in the knowledge graph.
CN201910415531.XA 2019-05-13 2019-05-13 Anti-fraud method and device based on knowledge graph Active CN110188198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910415531.XA CN110188198B (en) 2019-05-13 2019-05-13 Anti-fraud method and device based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910415531.XA CN110188198B (en) 2019-05-13 2019-05-13 Anti-fraud method and device based on knowledge graph

Publications (2)

Publication Number Publication Date
CN110188198A CN110188198A (en) 2019-08-30
CN110188198B true CN110188198B (en) 2021-06-22

Family

ID=67716779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910415531.XA Active CN110188198B (en) 2019-05-13 2019-05-13 Anti-fraud method and device based on knowledge graph

Country Status (1)

Country Link
CN (1) CN110188198B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704620B (en) * 2019-09-25 2022-06-10 海信集团有限公司 Method and device for identifying same entity based on knowledge graph
CN111191039B (en) * 2019-09-30 2021-04-13 腾讯科技(深圳)有限公司 Knowledge graph creation method, knowledge graph creation device and computer readable storage medium
CN110765117B (en) * 2019-09-30 2023-09-26 建信金融科技有限责任公司 Fraud identification method, fraud identification device, electronic equipment and computer readable storage medium
CN110751557B (en) * 2019-10-10 2023-04-18 建信金融科技有限责任公司 Abnormal fund transaction behavior analysis method and system based on sequence model
CN110909129B (en) * 2019-11-14 2022-11-04 上海秒针网络科技有限公司 Abnormal complaint event identification method and device
CN111056258B (en) * 2019-11-20 2021-08-10 秒针信息技术有限公司 Method and device for intelligently adjusting conveyor belt
CN111160847B (en) * 2019-12-09 2023-08-25 中国建设银行股份有限公司 Method and device for processing flow information
CN111178615B (en) * 2019-12-24 2023-10-27 成都数联铭品科技有限公司 Method and system for constructing enterprise risk identification model
CN111340546A (en) * 2020-02-25 2020-06-26 中信银行股份有限公司 Method, device, computer equipment and readable storage medium for improving marketing efficiency of banking business
TWI736233B (en) * 2020-04-23 2021-08-11 兆豐國際商業銀行股份有限公司 Pre-loan investigation system and pre-loan investigation method
CN111507543B (en) * 2020-05-28 2022-05-17 支付宝(杭州)信息技术有限公司 Model training method and device for predicting business relation between entities
CN111984798A (en) * 2020-09-27 2020-11-24 拉卡拉支付股份有限公司 Atlas data preprocessing method and device
CN112200583B (en) * 2020-10-28 2023-12-19 交通银行股份有限公司 Knowledge graph-based fraudulent client identification method
CN113449114A (en) * 2020-12-31 2021-09-28 中国科学技术大学智慧城市研究院(芜湖) Method for constructing natural human life cycle holographic image based on knowledge graph
CN112990369B (en) * 2021-04-26 2021-10-08 四川新网银行股份有限公司 Social network-based method and system for identifying waste escaping and debt behaviors
CN113807723B (en) * 2021-09-24 2023-11-03 重庆富民银行股份有限公司 Risk identification method for knowledge graph
CN114064939A (en) * 2022-01-17 2022-02-18 中证信息技术服务有限责任公司 Knowledge graph generation method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665252A (en) * 2017-09-27 2018-02-06 深圳证券信息有限公司 A kind of method and device of creation of knowledge collection of illustrative plates
CN107832407A (en) * 2017-11-03 2018-03-23 上海点融信息科技有限责任公司 For generating the information processing method, device and readable storage medium storing program for executing of knowledge mapping
CN109472485A (en) * 2018-11-01 2019-03-15 成都数联铭品科技有限公司 Enterprise breaks one's promise Risk of Communication inquiry system and method
CN109657837A (en) * 2018-11-19 2019-04-19 平安科技(深圳)有限公司 Default Probability prediction technique, device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10606893B2 (en) * 2016-09-15 2020-03-31 International Business Machines Corporation Expanding knowledge graphs based on candidate missing edges to optimize hypothesis set adjudication

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665252A (en) * 2017-09-27 2018-02-06 深圳证券信息有限公司 A kind of method and device of creation of knowledge collection of illustrative plates
CN107832407A (en) * 2017-11-03 2018-03-23 上海点融信息科技有限责任公司 For generating the information processing method, device and readable storage medium storing program for executing of knowledge mapping
CN109472485A (en) * 2018-11-01 2019-03-15 成都数联铭品科技有限公司 Enterprise breaks one's promise Risk of Communication inquiry system and method
CN109657837A (en) * 2018-11-19 2019-04-19 平安科技(深圳)有限公司 Default Probability prediction technique, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
探索知识图谱在商业银行风控领域的应用;周杰;《信息技术与标准化》;20190510(第5期);第29-32页 *

Also Published As

Publication number Publication date
CN110188198A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN110188198B (en) Anti-fraud method and device based on knowledge graph
Bracke et al. Machine learning explainability in finance: an application to default risk analysis
García et al. An insight into the experimental design for credit risk and corporate bankruptcy prediction systems
CN111291816B (en) Method and device for carrying out feature processing aiming at user classification model
CN109389494B (en) Loan fraud detection model training method, loan fraud detection method and device
Woods et al. Towards integrating insurance data into information security investment decision making
CN112541817A (en) Marketing response processing method and system for potential customers of personal consumption loan
Garrido et al. A Robust profit measure for binary classification model evaluation
CN113609193A (en) Method and device for training prediction model for predicting customer transaction behavior
CN116402512B (en) Account security check management method based on artificial intelligence
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
CN114782161A (en) Method, device, storage medium and electronic device for identifying risky users
CN115456745A (en) Small and micro enterprise portrait construction method and device
US11551317B2 (en) Property valuation model and visualization
CN114493686A (en) Operation content generation and pushing method and device
WO2019023406A9 (en) System and method for detecting and responding to transaction patterns
CN114723554B (en) Abnormal account identification method and device
CN110619564B (en) Anti-fraud feature generation method and device
CN112926989B (en) Bank loan risk assessment method and equipment based on multi-view integrated learning
CN113781201B (en) Risk assessment method and device for electronic financial activity
CN115618926A (en) Important factor extraction method and device for taxpayer enterprise classification
TWI792101B (en) Data Quantification Method Based on Confirmed Value and Predicted Value
CN114612239A (en) Stock public opinion monitoring and wind control system based on algorithm, big data and artificial intelligence
CN114756685A (en) Complaint risk identification method and device for complaint sheet
CN113849580A (en) Subject rating prediction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant