WO2020192184A1 - 基于图模型检测团伙欺诈 - Google Patents

基于图模型检测团伙欺诈 Download PDF

Info

Publication number
WO2020192184A1
WO2020192184A1 PCT/CN2019/124807 CN2019124807W WO2020192184A1 WO 2020192184 A1 WO2020192184 A1 WO 2020192184A1 CN 2019124807 W CN2019124807 W CN 2019124807W WO 2020192184 A1 WO2020192184 A1 WO 2020192184A1
Authority
WO
WIPO (PCT)
Prior art keywords
group
determined
user
gang
data
Prior art date
Application number
PCT/CN2019/124807
Other languages
English (en)
French (fr)
Inventor
黄剑飞
陈振
Original Assignee
北京三快在线科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京三快在线科技有限公司 filed Critical 北京三快在线科技有限公司
Publication of WO2020192184A1 publication Critical patent/WO2020192184A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/40Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
    • G06Q20/401Transaction verification
    • G06Q20/4016Transaction verification involving fraud or risk level assessment in transaction processing

Definitions

  • the present disclosure relates to the field of network technology, and in particular, to a method, device, and storage medium for detecting gang fraud based on a graph model.
  • the financial sector has high requirements for transaction risk control to ensure the safety of capital transactions.
  • fraudsters trick many ordinary consumers into transferring money to them, but they do not return corresponding returns to these consumers in order to make a profit.
  • high-risk fraudsters can be identified, and measures can be taken to avoid consumer loss of funds as much as possible.
  • a transaction model can be used to identify fraudsters, for example, a certain payment account is characterized as a fraudster's account, and a fund transaction conducted by the fraudster's account is characterized as a risky transaction.
  • the present disclosure provides a method, a device, and a storage medium for detecting group fraud based on a graph model to solve the technical problem that it is difficult to identify group fraud in related technologies.
  • the first aspect of the embodiments of the present disclosure provides a method for detecting gang fraud based on a graph model, the method including:
  • Acquire data of multiple users and historical suspected user data generate a user association graph according to the obtained data, wherein the nodes of the user association graph are user association subgraphs generated according to data characteristics, and the edge weights of the user association graph Including the similarity of the nodes; based on the user association graph, a community division algorithm is used to generate multiple groups to be determined; for each group to be determined, the suspicion of the group to be determined is calculated; for each group to be determined Determine the group set, and output the judgment result of the group to be judged according to the calculation result of the suspicion degree.
  • generating the user association graph includes: selecting feature combinations and group numbers in the data of the multiple users and the historical suspect user data; based on the feature combination and the number of groups, using The user association subgraph is generated in the manner of feature consistency equal or fuzzy equality; the user association subgraph is used as the node splicing to generate the user unweighted association graph; the similarity of the nodes in the user unweighted association graph is used as the edge weight to generate The user similarity weight association graph is used as the user association graph.
  • generating the plurality of gang sets to be determined by using the community division algorithm includes: generating n gang sets by using the community division algorithm based on the user association graph, where n is a positive integer; The gang set is adjusted according to the size of the number of users in the gang set to obtain multiple new gang sets; the multiple new gang sets are determined as the multiple gang sets to be determined.
  • adjusting according to the size of the number of users in the gang set includes: calling the community division algorithm to divide the gang set whose number of users is greater than a maximum threshold, so that the new gang set is The number of users of is less than or equal to the maximum threshold; if the number of gangs with the number of users less than the minimum threshold is greater than the preset threshold, the hierarchical clustering algorithm is called to aggregate the gangs with the number of users less than the minimum threshold .
  • the community division algorithm includes a graph label propagation algorithm or a GN algorithm;
  • the hierarchical clustering algorithm includes an aggregation algorithm or a split algorithm.
  • calculating the suspicion degree of the set of gangs to be determined includes: selecting target data features from the data characteristics; according to the proportion of the target data characteristics in the set of gangs to be determined, calculating State the degree of suspicion of the group to be determined.
  • calculating the suspicion degree of the set of gangs to be determined includes: extracting gang characteristics of each set of gangs to be determined; inputting the gang characteristics into a trained regression model to make the regression The model outputs the suspicion degree of the group to be determined.
  • calculating the suspicion score of the set of gangs to be determined includes: selecting target data features from the data characteristics; and calculating the proportion of the target data characteristics in the set of gangs to be determined The first suspicion score of the set of gangs to be determined; extract the gang characteristics of each set of gangs to be determined; input the gang characteristics into the trained regression model, so that the regression model outputs the to-be determined The second suspicion score of the group set; according to the first suspicion score and the second suspicion score, the comprehensive suspicion score of the group to be determined is calculated.
  • a device for detecting group fraud based on a graph model includes:
  • the obtaining module is used to obtain data of multiple users and historical suspected user data; the first generating module is used to generate a user association graph according to the obtained data, wherein the nodes of the user association graph are based on the characteristics of the data The generated user association subgraph, where the edge weights of the user association graph include the similarity of the nodes; the second generation module is configured to generate multiple groups to be determined based on the user association graph using a community division algorithm; calculate; The module is used to calculate the suspicion degree of the group to be judged for each group set to be judged; the output module is used to output the suspicion degree of the group set to be judged according to the calculation result of the suspicion degree The judgment result of the gang.
  • the first generation module includes: a first selection sub-module for selecting feature combinations and group numbers in the data of the multiple users and the historical suspicious user data; the first generation sub-module uses Based on the combination of features and the number of groups, the user association subgraph is generated by means of equal feature consistency or ambiguity, and the user association subgraph is used as nodes to splice and generate a user unweighted association graph; the second generator The module is configured to use the similarity of nodes in the user weightless association graph as edge weights to generate a user similarity weight association graph as the user association graph.
  • the second generation module includes: a third generation sub-module, configured to generate n group sets based on the user association graph and using the community division algorithm, where n is a positive integer; and an adjustment sub-module for For each gang set, the number of users in the gang set is adjusted to obtain multiple new gang sets; the third confirmation submodule is used to determine the multiple new gang sets as all A collection of multiple groups to be determined.
  • a third generation sub-module configured to generate n group sets based on the user association graph and using the community division algorithm, where n is a positive integer
  • an adjustment sub-module for For each gang set the number of users in the gang set is adjusted to obtain multiple new gang sets
  • the third confirmation submodule is used to determine the multiple new gang sets as all A collection of multiple groups to be determined.
  • the adjustment submodule further includes: a dividing unit configured to call the community division algorithm to divide a group set whose number of users is greater than a maximum threshold, so that the number of users in the new group set is less than Or equal to the maximum threshold; a coalescing module, used to call hierarchical clustering algorithm for the gangs whose number of users is less than the minimum threshold if the number of groups with the number of users is less than the minimum threshold is greater than the preset threshold The collection is condensed.
  • a dividing unit configured to call the community division algorithm to divide a group set whose number of users is greater than a maximum threshold, so that the number of users in the new group set is less than Or equal to the maximum threshold
  • a coalescing module used to call hierarchical clustering algorithm for the gangs whose number of users is less than the minimum threshold if the number of groups with the number of users is less than the minimum threshold is greater than the preset threshold The collection is condensed.
  • the community division algorithm includes a graph label propagation algorithm or a GN algorithm;
  • the hierarchical clustering algorithm includes an aggregation algorithm or a split algorithm.
  • the calculation module includes: a second selection sub-module for selecting target data characteristics from the data characteristics; a first calculation sub-module for selecting the target data characteristics in the group to be determined Calculate the suspicion degree of the group to be determined.
  • the calculation module includes: a first extraction sub-module, used to extract gang features of each set of gangs to be determined; a first input sub-module, used to input the gang characteristics into the trained regression model , So that the regression model outputs the suspicion degree of the group to be determined.
  • the calculation module includes: a third selection sub-module for selecting target data characteristics from the data characteristics; a second calculation sub-module for selecting the target data characteristics in the group to be determined Calculate the first suspicion score of the group to be determined; the second extraction sub-module is used to extract the gang characteristics of each group to be determined; the second input sub-module is used to The gang characteristics are input into the trained regression model, so that the regression model outputs the second suspicion score of the set of gangs to be determined; the third calculation sub-module is used to calculate the second suspicion score according to the first suspicion score and the The second suspicion score is used to calculate the comprehensive suspicion score of the group to be determined.
  • a third aspect of the embodiments of the present disclosure provides a non-volatile computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the processor is prompted to implement any one of the above-mentioned first aspects. The steps of the method described in item.
  • a device for detecting gang fraud based on a graph model including: a memory on which a computer program is stored; and a processor for executing the computer program in the memory to The steps of the method described in any one of the foregoing first aspects are implemented.
  • the present disclosure generates a user association graph based on the acquired user data, and uses a community division algorithm to generate a set of gangs to be determined. By calculating the suspicion of the set of gangs to be determined, it can be distinguished Whether the group to be determined belongs to a fraud group, which improves the accuracy of identifying group fraud.
  • the present disclosure also uses a community division algorithm and a hierarchical clustering algorithm to solve the problem of too large a group and a large number of smaller groups in the group division result.
  • the present disclosure uses similar indexing to improve the data processing capabilities of the graph model. At the same time, it uses sub-graph assembly and configurable similar edge weights to generate user similar weight association graphs. This method is more flexible and parallel, and can further improve fraud scenarios. Large-scale data processing capabilities under the
  • Fig. 1 is a flowchart of a method for detecting gang fraud based on a graph model according to an exemplary embodiment of the present disclosure.
  • Fig. 2 is a detailed flowchart of step S12 in Fig. 1 according to an exemplary embodiment.
  • Fig. 3 is a detailed flowchart of step S13 in Fig. 1 according to an exemplary embodiment.
  • Fig. 4 is a detailed flowchart of step S14 in Fig. 1 according to an exemplary embodiment.
  • Fig. 5 is a detailed flowchart of step S14 in Fig. 1 according to another exemplary embodiment.
  • Fig. 6 is a detailed flowchart of step S14 in Fig. 1 according to still another exemplary embodiment.
  • Fig. 7 is a block diagram of an apparatus for detecting gang fraud based on a graph model according to an exemplary embodiment of the present disclosure.
  • Fig. 8 is a block diagram of a first generating module of an apparatus for detecting gang fraud based on a graph model according to an exemplary embodiment of the present disclosure.
  • Fig. 9 is a block diagram of the second generation module of an apparatus for detecting gang fraud based on a graph model according to an exemplary embodiment of the present disclosure.
  • Fig. 10 is a block diagram of an adjustment sub-module of an apparatus for detecting gang fraud based on a graph model according to an exemplary embodiment of the present disclosure.
  • Fig. 11 is a block diagram of a calculation module of an apparatus for detecting gang fraud based on a graph model according to an exemplary embodiment of the present disclosure.
  • Fig. 12 is a block diagram of a calculation module of an apparatus for detecting gang fraud based on a graph model according to another exemplary embodiment of the present disclosure.
  • Fig. 13 is a block diagram of a calculation module of an apparatus for detecting gang fraud based on a graph model according to still another exemplary embodiment of the present disclosure.
  • Fig. 14 is a block diagram of a hardware device for detecting gang fraud based on a graph model according to an exemplary embodiment of the present disclosure.
  • a detection method based on black and white lists and reputation database search can be used.
  • This method requires irregular maintenance and addition of new black-and-white lists or reputation library content.
  • This maintenance method has a relatively high cost, such as a third-party paid data purchase, and the method has limited response and coverage.
  • a detection method based on a rule engine can be used. Online financial fraud methods are changeable. When fraudsters change the fraud methods, the detection methods based on the rule engine often fail, and a lot of operation and financial resources are required to update the rule engine.
  • a detection method based on supervised machine learning can be used.
  • Supervised machine learning is the most widely used learning method in fraud detection.
  • the machine learning model uses such as decision tree, random forest, support vector machine SVM (Support Vector Machine) and naive Bayes algorithm to perform complex calculations with hundreds of variables (that is, high-dimensional space) to accurately lock fraud .
  • supervised machine learning methods rely on labeled data.
  • the labeled data is difficult to obtain in financial fraud scenarios and the positive and negative samples are imbalanced. Among them, positive samples are marked only after fraud occurs. In financial fraud scenarios, fraud methods are changeable and Fewer samples will make labeling more difficult. If there is insufficient fraud labeling data, the ability to supervise machine learning is limited.
  • Unsupervised learning is a branch of current fraud detection application exploration. It is mainly based on clustering and graph methods. The current unsupervised technology is relatively mature and difficult. There is no ready-made solution to effectively integrate unsupervised machine learning. Used for fraud detection. The main difficulty lies in how to solve large-scale data processing and quantification of suspect determination.
  • FIG. 1 is a flowchart of a method for detecting group fraud based on a graph model according to an exemplary embodiment of the present disclosure, so as to improve the recognition level of identifying group fraud.
  • the detection of group fraud based on the graph model includes the following steps.
  • S12 Generate a user association graph according to the acquired data, wherein the nodes of the user association graph are user association subgraphs generated according to data characteristics, and the edge weights of the user association graph include similarity of nodes.
  • the user data may be data of various client accounts applied by the user, such as user data when applying for a Meituan account, user data when applying for an Alipay account, and user data when applying for a WeChat account.
  • the account can be associated with the bank card number applied by the user, such as a savings card or a credit card.
  • the user data may also be data corresponding to users who use a payment platform to make payments, such as user data for payment using Meituan, user data for payment using Alipay, user data for payment using WeChat, and so on.
  • the user data also includes basic user data.
  • the user's basic data includes the applicant's filling in the application form, the PBC report query information, the mobile terminal behavior data authorized by the applicant, e-commerce data, and social data.
  • the historical suspected user data may include black and white list information, and the black and white list may be any entity type in the network, such as account, address, phone number, etc.
  • the blacklist includes fraudulent, severely overdue, or exchange blacklists accumulated in the industry.
  • the whitelist includes VIP customers or manually marked risk-free phone numbers and addresses.
  • step S12 is executed to generate a user association graph based on the acquired data, wherein the nodes of the user association graph are user association subgraphs generated according to data characteristics, and The edge weight of the user association graph includes the similarity of the nodes.
  • generating a user association map based on the acquired data may include the following steps.
  • S121 Select feature combinations and group numbers in the data of the multiple users and the historical suspected user data.
  • a user association subgraph is generated in a manner of equal feature consistency or ambiguity, and the user association subgraph is spliced as nodes to generate a user weightless association graph.
  • S123 Use similarity of nodes in the user weightless association graph as edge weights, and generate a user similarity weight association graph as the user association graph.
  • the features in the data may be features such as device ID, IP address, IMSI (International Mobile Subscriber Identity), IMEI (International Mobile Equipment Identity), geographic information, login time, etc.
  • the feature combination is to select at least one feature from the features in the data as a group, and the number of groups is also at least one group.
  • the feature consistency equal or fuzzy equal method After selecting the feature combination and the number of groups, use the feature consistency equal or fuzzy equal method to associate different feature combinations to form a user association subgraph. For example, if the device IDs of different accounts are the same, the two accounts can be linked by feature consistency and equal; the IP addresses of different accounts are the same, that is, if you have logged in to different accounts under the same LAN, you can use the feature
  • the ambiguity equality method associates the two accounts.
  • the user association subgraph After the user association subgraph is generated, the user association subgraph is used as nodes to splice and generate a user weightless association graph. Then, the similarity of each node in the user weightless association graph is used as the edge weight to generate the user similarity weight association graph, and the similarity can be calculated using a similarity measurement function.
  • Jaccard distance Jaccardsimilarity coefficient
  • the similarity between two nodes can be used as the edge weight
  • pruning optimization can be selected to generate a user similarity weight association graph based on the edge weight.
  • the core of the pruning optimization is to set an edge weight threshold. If the weight (similarity) of an associated edge between two nodes is less than the edge weight threshold, the associated edge can be pruned.
  • step S13 is executed, based on the user association graph, a community division algorithm is used to generate a plurality of groups to be determined. Please refer to Figure 3, using the community division algorithm to generate multiple groups to be determined, including the following steps.
  • S131 Based on the user association graph, use a community division algorithm to generate n group sets, where n is a positive integer.
  • the community division algorithm includes a graph label propagation algorithm or a GN (Girvan-Newman) algorithm.
  • step S132 adjust the size of the number of users in each group set so that the number of users in each group set is less than or equal to the maximum threshold, and the number of group sets with the number of users less than the minimum threshold is less than or equal to the preset Threshold.
  • the maximum threshold is greater than the minimum threshold.
  • the community division algorithm is used to continue the division so that the number of users in the gang set is less than or equal to the maximum threshold. For example, assuming that the maximum threshold is 20, when the number of different accounts in a gang set exceeds 20, continue to use the community division algorithm to divide so that the number of users in the gang set is less than or equal to 20. If the number of gang sets whose number of users is less than the minimum threshold is greater than the preset threshold, the hierarchical clustering algorithm is called to condense the gang sets whose number of users is less than the minimum threshold.
  • hierarchical clustering can choose hierarchical aggregation or splitting .
  • the minimum threshold is 3 and the preset threshold is 15. If the number of different accounts is less than 3 and the number of gang sets exceeds 15, then the hierarchical clustering algorithm is called to condense the gang sets with the number of users less than 3, so that different The number of accounts is less than 3 and the number of groups does not exceed 15.
  • m new gang sets are generated.
  • the number of users in each gang set is less than or equal to 20.
  • Each new group set is regarded as a group to be determined, so m groups of groups to be determined are obtained.
  • step S14 is executed to calculate the suspicion degree of the group set to be determined for each group set to be determined.
  • the calculation methods of suspicion include but are not limited to the following three methods.
  • the first calculation method calculates the suspicion degree of the group to be determined, including the following steps.
  • S142a Calculate the suspicion degree of the group to be determined according to the proportion of the target data feature in the group of to be determined.
  • Target data characteristics refer to core element characteristics.
  • target data characteristics can be specified based on the business and scenario.
  • the target data feature can be one or more of UUID (Universally Unique Identifier), IP address, operation location information, operation time node information, device type information, and system version information.
  • UUID Universally Unique Identifier
  • IP address IP address
  • operation location information IP address
  • operation time node information IP address
  • device type information IP address
  • system version information For example, for the scenario of judging false registration, the core elements can be IP address, device type and registration time. Then for a group to be judged, the target data feature can be IP address, device type and registration. One or more of the times.
  • all feature fields in the generated user association graph can be directly used as target data features.
  • the proportion of a certain data feature in the overall data and the proportion of the data feature in the group set to be determined can be calculated.
  • the overall data refers to the data of all users. If the difference between the two proportions exceeds the target threshold, the data feature is taken as the target data feature.
  • the distribution of accounts registered using virtual phone numbers among the accounts is 8%.
  • the number of accounts is 10, of which 7 accounts are registered using virtual mobile phone numbers, the distribution ratio is 70%, 70% vs. 8%, and there is a big difference.
  • the account registered by the virtual mobile phone number is used as the target data feature, and the proportion of the target data feature in the group to be determined is 0.7, and this proportion can be used as the suspicion degree of the group to be determined.
  • the distribution ratio of accounts registered by historical suspicious users among the accounts is 8%.
  • the number of accounts is 10, of which 8 accounts are registered by historical suspected users, the distribution ratio is 80%, 80% vs. 8%, and the difference is very large.
  • the account registered by the historical suspect user is the target data feature, and the proportion of the target data feature in the group to be determined is 0.8, and this proportion can be used as the suspicion degree of the group to be determined.
  • the second calculation method referring to FIG. 5, calculating the suspicion degree of the group to be determined may include the following steps.
  • S141b Extract the group characteristics of each group to be determined.
  • the gang characteristics include at least the characteristics of the proportion of historical suspected users, and may also include characteristics such as the size of the gang and the proportion of the number of shared device accounts.
  • the regression model may be a GBDT (Gradient Boosting Decision Tree; gradient boosting decision tree) model.
  • the third calculation method referring to FIG. 6, calculating the suspicion degree of the group to be determined may include the following steps.
  • S142c Calculate the first suspicion score of the group to be judged according to the proportion of the target data feature in the group to be judged.
  • S143c Extract the group characteristics of each group to be determined.
  • S144c Input the gang characteristics into the trained regression model so that the regression model outputs the second suspicion score of the set of gangs to be determined.
  • S145c Calculate the comprehensive suspicion score of the group to be determined based on the first suspicion score and the second suspicion score, and use it as the suspicion of the group to be determined.
  • the judgment result of the group to be judged is output. For example, when the comprehensive suspicion score exceeds a preset value, it can be determined that the group to be determined is a fraud group.
  • the comprehensive suspicion score of the group to be determined can be the average of the two scores of 0.75, which exceeds the preset value. If the value is 0.6, the group to be determined is a fraud group.
  • the present disclosure generates a user association graph based on the acquired user data and historical suspected user data, and uses a community division algorithm to generate multiple groups to be determined. For each group to be determined, the suspicion degree of the group to be determined is calculated, namely It can be distinguished whether the group to be determined is a fraud group, which provides the accuracy of identifying group fraud.
  • the present disclosure also uses a community division algorithm and a hierarchical clustering algorithm to solve the problem of too large a group and a large number of smaller groups in the group division result.
  • the present disclosure uses similar indexing to improve the data processing capabilities of the graph model, and at the same time uses sub-graph assembly and similar edge weights to generate user similar weight association graphs as user association graphs. This method is more flexible and can be further improved Large-scale data processing capabilities in fraud scenarios.
  • Fig. 7 shows a device for detecting gang fraud based on a graph model according to an exemplary embodiment of the present disclosure.
  • the apparatus 300 for detecting group fraud based on a graph model includes the following modules.
  • the obtaining module 310 is used to obtain data of multiple users and historical suspected user data.
  • the first generating module 320 is configured to generate a user association graph according to the acquired data, wherein the nodes of the user association graph are user association subgraphs generated according to data characteristics, and the edge weights of the user association graph include similarity of nodes degree.
  • the second generating module 330 is configured to generate a plurality of groups to be determined by using a community division algorithm based on the user association graph.
  • the calculation module 340 is configured to calculate the suspicion degree of each group set to be judged
  • the output module 350 is configured to output the judgment result of the group to be judged according to the calculation result of the suspicion degree for each group to be judged.
  • the first generation module 320 includes: a first selection sub-module 321, configured to select feature combinations and group numbers in the multiple user data and the historical suspect user data;
  • the first generation sub-module 322 is configured to generate user association subgraphs based on the combination of features and the number of groups, using the same feature consistency or equal ambiguity to generate user association subgraphs, and use the user association subgraphs as nodes to splice and generate user no Weight correlation graph;
  • the second generation module 330 includes: a third generation sub-module 331, configured to generate n group sets based on the user association graph and using a community division algorithm, where n is a positive integer;
  • the adjustment sub-module 332 is used to adjust each group set according to the number of users in the group set to obtain multiple new group sets;
  • the confirmation sub-module 333 is used to combine the multiple new groups
  • the group set of is determined to be multiple groups to be determined.
  • the adjustment sub-module 332 further includes: a dividing unit 3321, configured to call a community dividing algorithm to divide a group set whose number of users is greater than a maximum threshold, so that the new group The number of users in the set is less than or equal to the maximum threshold; the agglomeration unit 3322 is configured to, if the number of gang sets whose number of users is less than the minimum threshold is greater than the preset threshold, call the hierarchical clustering algorithm to determine if the number of users is less than the minimum threshold The gangs gather for cohesion.
  • a dividing unit 3321 configured to call a community dividing algorithm to divide a group set whose number of users is greater than a maximum threshold, so that the new group The number of users in the set is less than or equal to the maximum threshold
  • the agglomeration unit 3322 is configured to, if the number of gang sets whose number of users is less than the minimum threshold is greater than the preset threshold, call the hierarchical clustering algorithm to determine if the
  • the community division algorithm includes a graph label propagation algorithm or a GN algorithm;
  • the hierarchical clustering algorithm includes an aggregation algorithm or a split algorithm.
  • the calculation module 340 includes: a second selection sub-module 341a, configured to select target data features from the data characteristics; a first calculation sub-module 342a, configured according to the target The proportion of the data feature in the group to be determined is calculated, and the suspicion of the group to be determined is calculated.
  • the calculation module 340 includes: a first extraction sub-module 341b, which is used to extract gang characteristics of each set of gangs to be determined; a first input sub-module 342b, which is used to extract all The gang characteristics are input into the trained regression model so that the regression model outputs the suspicion degree of the set of gangs to be determined.
  • the calculation module 340 includes: a third selection sub-module 341c, configured to select a target data feature from the data characteristics; a second calculation sub-module 342c, configured to select a target data feature according to the target The proportion of data characteristics in the group set to be judged is calculated, and the first suspicion score of the group set to be judged is calculated; the second extraction submodule 343c is used to extract the group characteristics of each group set to be judged; The second input submodule 344c is used to input the gang characteristics into the trained regression model so that the regression model outputs the second suspicion score of the set of gangs to be determined; the third calculation submodule 345c is used to According to the first suspicion score and the second suspicion score, the comprehensive suspicion score of the group to be determined is calculated.
  • a third selection sub-module 341c configured to select a target data feature from the data characteristics
  • a second calculation sub-module 342c configured to select a target data feature according to the target The proportion of data characteristics in the group set to
  • the present disclosure also provides a non-volatile computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the processor is prompted to implement the graph model-based detection in any of the above-mentioned optional embodiments.
  • Method steps for gang fraud are prompted to implement the graph model-based detection in any of the above-mentioned optional embodiments.
  • the present disclosure also provides an apparatus for detecting gang fraud based on a graph model, including: a memory on which a computer program is stored; and a processor for executing the computer program in the memory to realize any of the above The steps of the method for detecting gang fraud based on the graph model of the selected embodiment.
  • Fig. 14 is a block diagram showing an apparatus 400 for detecting gang fraud based on a graph model according to an exemplary embodiment.
  • the device 400 may include: a processor 401, a memory 402, a multimedia component 403, an input/output (Input/Output) interface 404, and a communication component 405.
  • the processor 401 is used to control the overall operation of the device 400 to complete all or part of the steps in the method for detecting gang fraud based on the graph model.
  • the memory 402 is used to store various types of data to support operations on the device 400, and these data may include, for example, instructions for any application or method operating on the device 400, and application-related data.
  • the memory 402 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (Static Random Access Memory, SRAM for short), electrically erasable programmable read-only memory ( Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-only Memory (Read-Only Memory, ROM for short), magnetic memory, flash memory, magnetic disk or optical disk.
  • the multimedia component 403 may include a screen and an audio component.
  • the screen may be a touch screen, for example, and the audio component is used to output and/or input audio signals.
  • the audio component may include a microphone, which is used to receive external audio signals.
  • the received audio signal can be further stored in the memory 402 or sent through the communication component 405.
  • the audio component also includes at least one speaker for outputting audio signals.
  • the I/O interface 404 provides an interface between the processor 401 and other interface modules.
  • the above-mentioned other interface modules may be keyboards, mice, buttons, and so on. These buttons can be virtual buttons or physical buttons.
  • the communication component 405 is used for wired or wireless communication between the apparatus 400 and other devices. Wireless communication, such as Wi-Fi, Bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so the corresponding communication component 405 may include: Wi-Fi module, Bluetooth module, NFC module.
  • the device 400 may be used by one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), digital signal processor (Digital Signal Processor, DSP for short), and digital signal processing equipment (Digital Signal Processing Equipment). Processing Device, DSPD for short), Programmable Logic Device (PLD for short), Field Programmable Gate Array (FPGA for short), controller, microcontroller, microprocessor or other electronic components , Used to implement the above-mentioned method of detecting gang fraud based on the graph model.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processor
  • DSP Digital Signal Processing Equipment
  • Processing Device DSPD for short
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • a computer-readable storage medium including program instructions is also provided, such as a memory 402 including program instructions.
  • the program instructions can be executed by the processor 401 of the device 400 to complete the graph-based model described above. Methods of detecting gang fraud.

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Computer Security & Cryptography (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于图模型检测团伙欺诈的方法和装置、存储介质。所述基于图模型检测团伙欺诈的方法包括:获取多个用户的数据和历史嫌疑用户数据(S11);根据获取的数据,生成用户关联图(S12),其中,所述用户关联图的节点为根据数据特征生成的用户关联子图,所述用户关联图的边权重包括节点的相似度;基于所述用户关联图,利用社区划分算法生成多个待判定团伙集合(S13);对于每个待判定团伙集合,计算所述待判定团伙集合的嫌疑度(S14);对于每个待判定团伙集合,根据所述嫌疑度的计算结果,输出所述待判定团伙的判定结果(S15)。

Description

基于图模型检测团伙欺诈 技术领域
本公开涉及网络技术领域,具体地,涉及一种基于图模型检测团伙欺诈的方法和装置、存储介质。
背景技术
金融领域对交易风险控制的要求较高,以保证资金交易的安全性。在实际应用中,可能会存在一些欺诈行为。比如,欺诈者诱骗很多的普通消费者向其转账,但是却不向这些消费者返还相应的回报,以此进行牟利。为了识别上述的欺诈行为,可以将高风险的欺诈者识别出来,以采取措施尽量避免消费者的资金损失。在一个例子中,可以利用交易模型来识别欺诈者,比如,将某个支付账户定性为欺诈者账户,将欺诈者账户进行的资金交易定性为风险交易。
发明内容
本公开提供一种基于图模型检测团伙欺诈的方法和装置、存储介质,以解决相关技术中难以识别团伙欺诈的技术问题。
为实现上述目的,本公开实施例的第一方面,提供一种基于图模型检测团伙欺诈的方法,所述方法包括:
获取多个用户的数据和历史嫌疑用户数据;根据获取的数据,生成用户关联图,其中,所述用户关联图的节点为根据数据特征生成的用户关联子图,所述用户关联图的边权重包括所述节点的相似度;基于所述用户关联图,利用社区划分算法生成多个待判定团伙集合;对于每个待判定团伙集合,计算所述待判定团伙集合的嫌疑度;对于每个待判定团伙集合,根据所述嫌疑度的计算结果,输出所述待判定团伙的判定结果。
可选地,生成所述用户关联图,包括:选取所述多个用户的所述数据和所述历史嫌疑用户数据中的特征组合和组数;基于所述特征组合和所述组数,利用特征一致性相等或模糊性相等方式生成用户关联子图;以所述用户关联子图为节点拼接生成用户无权重关联图;以所述用户无权重关联图中节点的相似度作为边权重,生成用户相似权重关联图作为所述用户关联图。
可选地,利用所述社区划分算法生成所述多个待判定团伙集合,包括:基于所述用户关联图,利用所述社区划分算法生成n个团伙集合,n为正整数;对于每个所述团伙集合,根据所述团伙集合的用户数量的大小进行调整,以得到多个新的团伙集合;将所述多个新的团伙集合确定为所述多个待判定团伙集合。
可选地,根据所述团伙集合的所述用户数量的大小进行调整,包括:对用户数量大于极大阈值的团伙集合,调用所述社区划分算法进行划分,以使所述新的团伙集合中的用户数量小于或等于所述极大阈值;若用户数量小于极小阈值的团伙集合的数量大于预设阈值,调用层次聚类算法对所述用户数量小于极小阈值的所述团伙集合进行凝聚。
可选地,所述社区划分算法包括图标签传播算法或GN算法;所述层次聚类算法包括凝聚算法或***算法。
可选地,计算所述待判定团伙集合的所述嫌疑度,包括:从所述数据特征中选取目标数据特征;根据所述目标数据特征在所述待判定团伙集合中的占比,计算所述待判定团伙集合的所述嫌疑度。
可选地,计算所述待判定团伙集合的所述嫌疑度,包括:抽取每个所述待判定团伙集合的团伙特征;将所述团伙特征输入训练好的回归模型中,以使所述回归模型输出所述待判定团伙集合的所述嫌疑度。
可选地,计算所述待判定团伙集合的所述嫌疑度得分,包括:从所述数据特征中选取目标数据特征;根据所述目标数据特征在所述待判定团伙集合中的占比,计算所述待判定团伙集合的第一嫌疑度得分;抽取每个所述待判定团伙集合的团伙特征;将所述团伙特征输入训练好的回归模型中,以使所述回归模型输出所述待判定团伙集合的第二嫌疑度得分;根据所述第一嫌疑度得分以及所述第二嫌疑度得分,计算所述待判定团伙集合的综合嫌疑度得分。
本公开实施例的第二方面,提供一种基于图模型检测团伙欺诈的装置,所述装置包括:
获取模块,用于获取多个用户的数据和历史嫌疑用户数据;第一生成模块,用于根据获取的数据,生成用户关联图,其中,所述用户关联图的节点为根据所述数据的特征生成的用户关联子图,所述用户关联图的边权重包括所述节点的相似度;第二生成模块,用于基于所述用户关联图,利用社区划分算法生成多个待判定团伙集合;计算模块,用于对于每个待判定团伙集合,计算所述待判定团伙集合的嫌疑度;输出模块,用于对于 每个待判定团伙集合,根据所述嫌疑度的计算结果,输出所述待判定团伙的判定结果。
可选地,所述第一生成模块包括:第一选取子模块,用于选取所述多个用户的数据和所述历史嫌疑用户数据中的特征组合和组数;第一生成子模块,用于基于所述特征组合和所述组数,利用特征一致性相等或模糊性相等方式生成用户关联子图,并以所述用户关联子图为节点拼接生成用户无权重关联图;第二生成子模块,用于以所述用户无权重关联图中节点的相似度作为边权重,生成用户相似权重关联图作为所述用户关联图。
可选地,所述第二生成模块包括:第三生成子模块,用于基于所述用户关联图,利用所述社区划分算法生成n个团伙集合,n为正整数;调整子模块,用于对于每个所述团伙集合,根据所述团伙集合的用户数量的大小进行调整,以得到多个新的团伙集合;第三确认子模块,用于将所述多个新的团伙集合确定为所述多个待判定团伙集合。
可选地,所述调整子模块还包括:划分单元,用于对用户数量大于极大阈值的团伙集合,调用所述社区划分算法进行划分,以使所述新的团伙集合中的用户数量小于或等于所述极大阈值;凝聚模块,用于若用户数量小于极小阈值的团伙集合的数量大于预设阈值,调用层次聚类算法对所述用户数量小于所述极小阈值的所述团伙集合进行凝聚。
可选地,所述社区划分算法包括图标签传播算法或GN算法;所述层次聚类算法包括凝聚算法或***算法。
可选地,所述计算模块包括:第二选取子模块,用于从所述数据特征中选取目标数据特征;第一计算子模块,用于根据所述目标数据特征在所述待判定团伙集合中的占比,计算所述待判定团伙集合的嫌疑度。
可选地,所述计算模块包括:第一抽取子模块,用于抽取每个所述待判定团伙集合的团伙特征;第一输入子模块,用于将所述团伙特征输入训练好的回归模型中,以使所述回归模型输出所述待判定团伙集合的嫌疑度。
可选地,所述计算模块包括:第三选取子模块,用于从所述数据特征中选取目标数据特征;第二计算子模块,用于根据所述目标数据特征在所述待判定团伙集合中的占比,计算所述待判定团伙集合的第一嫌疑度得分;第二抽取子模块,用于抽取每个所述待判定团伙集合的团伙特征;第二输入子模块,用于将所述团伙特征输入训练好的回归模型中,以使所述回归模型输出所述待判定团伙集合的第二嫌疑度得分;第三计算子模块,用于根据所述第一嫌疑度得分以及所述第二嫌疑度得分,计算所述待判定团伙集合的综合嫌疑度得分。
本公开实施例的第三方面,提供一种非易失性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时,促使所述处理器实现上述第一方面中任一项所述方法的步骤。
本公开实施例的第四方面,提供一种基于图模型检测团伙欺诈的装置,包括:存储器,其上存储有计算机程序;以及处理器,用于执行所述存储器中的所述计算机程序,以实现上述第一方面中任一项所述方法的步骤。
采用上述技术方案,至少能够达到如下技术效果:本公开根据获取的用户数据,生成用户关联图,并利用社区划分算法生成待判定团伙集合,通过计算待判定团伙集合的嫌疑度,即可以分辨出该待判定团伙集合是否属于欺诈团伙,提高了识别团伙欺诈的准确性。另外,本公开还使用社区划分算法和层次聚类算法,解决了团伙划分结果中团伙规模过大、较小的团伙规模数量很多的问题。并且,本公开通过相似索引的手段来提升图模型数据处理能力,同时利用子图组装、相似边权重可配置地方式生成用户相似权重关联图,这种方法更加灵活可并行,可以进一步提升欺诈场景下的大规模数据处理能力。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
附图是用来提供对本公开的进一步理解,并且构成说明书的一部分,与下面的具体实施方式一起用于解释本公开,但并不构成对本公开的限制。
图1是本公开一示例性实施例示出的一种基于图模型检测团伙欺诈的方法流程图。
图2是根据一示例性实施例示出的图1中步骤S12的具体流程图。
图3是根据一示例性实施例示出的图1中步骤S13的具体流程图。
图4是根据一示例性实施例示出的图1中步骤S14的具体流程图。
图5是根据另一示例性实施例示出的图1中步骤S14的具体流程图。
图6是根据再一示例性实施例示出的图1中步骤S14的具体流程图。
图7是本公开一示例性实施例示出的一种基于图模型检测团伙欺诈的装置框图。
图8是本公开一示例性实施例示出的一种基于图模型检测团伙欺诈的装置的第一生成模块框图。
图9是本公开一示例性实施例示出的一种基于图模型检测团伙欺诈的装置的第二生 成模块框图。
图10是本公开一示例性实施例示出的一种基于图模型检测团伙欺诈的装置的调整子模块框图。
图11是本公开一示例性实施例示出的一种基于图模型检测团伙欺诈的装置的计算模块框图。
图12是本公开另一示例性实施例示出的一种基于图模型检测团伙欺诈的装置的计算模块框图。
图13是本公开再一示例性实施例示出的一种基于图模型检测团伙欺诈的装置的计算模块框图。
图14是本公开一示例性实施例示出的一种基于图模型检测团伙欺诈的硬件装置框图。
具体实施方式
以下结合附图对本公开的具体实施方式进行详细说明。应当理解的是,此处所描述的具体实施方式仅用于说明和解释本公开,并不用于限制本公开。
为了应对无处不在的攻击,欺诈检测在当下显得至关重要。针对金融欺诈检测可以采用如下几种检测方法。
在一个例子中,可以使用基于黑白名单、信誉库查找的检测方法。该方法需要不定期维护添加新的黑白名单或信誉库内容,这种维护方法成本比较高,如第三方有偿数据购买,且方法响应和覆盖性有限。
在另一个例子中,可以使用基于规则引擎的检测方法。线上金融欺诈手段多变,当欺诈者改变欺诈手段后,基于规则引擎的检测方法往往就会失效,需要投入大量运营和财力资源去更新规则引擎。
在又一个例子中,可以使用基于有监督机器学习的检测方法。有监督机器学习是欺诈检测中应用最广泛的学习方法。机器学习模型通过会运用如决策树、随机森林、支持向量机SVM(Support Vector Machine)和朴素贝叶斯算法等,进行数百个变量(也即高维空间)的复杂计算,准确锁定欺诈行为。但有监督机器学习方法依赖于标注数据,标注数据在金融欺诈场景获取难度比较大、正负样本失衡,其中,正样本只有当欺诈发 生后打标才有,在金融欺诈场景欺诈手段多变且样本较少会导致打标注较难。若缺少足够欺诈标注数据,有监督机器学习的能力有限。
在再一个例子中,可以使用基于无监督学习的检测方法。无监督学习是目前欺诈检测应用探索的一个分支,主要是基于聚类和图方法进行研究,当前无监督技术成熟度较低,难度比较大,没有现成的解决方案可以有效的将无监督机器学习用于欺诈检测。主要难度在于如何解决大规模数据处理、嫌疑判定量化等。
图1是本公开一示例性实施例示出的一种基于图模型检测团伙欺诈的方法流程图,以提高识别团伙欺诈的识别水平。如图1所示,该基于图模型检测团伙欺诈包括以下步骤。
S11,获取多个用户的数据和历史嫌疑用户数据。
S12,根据获取的数据,生成用户关联图,其中,所述用户关联图的节点为根据数据特征生成的用户关联子图,所述用户关联图的边权重包括节点的相似度。
S13,基于所述用户关联图,利用社区划分算法生成多个待判定团伙集合。
S14,对于每个待判定团伙集合,计算所述待判定团伙集合的嫌疑度。
S15,对于每个待判定团伙集合,根据所述嫌疑度的计算结果,输出所述待判定团伙的判定结果。
在步骤S11中,所述用户数据可以是用户申请的各种客户端账号的数据,比如申请美团账号时的用户数据、申请支付宝账号时的用户数据、申请微信账号时的用户数据等,所述账号可以关联用户申请的银行***,比如储蓄卡或者***。所述用户数据也可以是利用支付平台进行支付的用户对应的数据,比如利用美团进行支付的用户数据、利用支付宝进行支付的用户数据、利用微信进行支付的用户数据等等。更进一步的,所述用户数据还包括用户的基础数据。用户的基础数据包括申请人填写申请书资料、人行报告查询信息、申请人授权的移动端行为数据、电商数据、以及社交数据。所述历史嫌疑用户数据可以包括黑白名单信息,黑白名单可以是网络中的任何实体类型,比如账户、地址、电话号码等。黑名单包括行内积累的欺诈、严重逾期、或者交换黑名单,白名单包括vip客户或者人工标记无风险的电话、地址等。
在获取多个用户的数据和历史嫌疑用户数据后,执行步骤S12,根据获取的数据,生成用户关联图,其中,所述用户关联图的节点为根据数据特征生成的用户关联子图,所述用户关联图的边权重包括节点的相似度。
参考图2,根据获取的数据,生成用户关联图,可以包括以下步骤。
S121,选取所述多个用户的数据和所述历史嫌疑用户数据中的特征组合和组数。
S122,基于特征组合和组数,利用特征一致性相等或模糊性相等方式生成用户关联子图,并以所述用户关联子图为节点拼接生成用户无权重关联图。
S123,以所述用户无权重关联图中节点的相似度作为边权重,生成用户相似权重关联图作为所述用户关联图。
在步骤S121中,所述数据中的特征可以是设备ID、IP地址、IMSI(国际移动用户识别码)、IMEI(国际移动设备识别码)、地理信息、登录时间等特征。所述特征组合是从所述数据中的特征中选出至少一个特征作为一组,所述组数至少也是一组。
选取特征组合和组数后,利用特征一致性相等或模糊性相等方式,将不同的特征组合关联起来形成用户关联子图。比如,不同账号登录的设备ID相同,则可以利用特征一致性相等方式,将该两个账号关联起来;不同账号登录的IP地址部分相同,即同一个局域网下登录过不同账号,则可以利用特征模糊性相等方式将该两个账号关联起来。生成用户关联子图后,以所述用户关联子图为节点拼接生成用户无权重关联图。接着,以所述用户无权重关联图中各个节点的相似度作为边权重,生成用户相似权重关联图,相似度可以使用相似度衡量函数来计算。例如,可以用杰卡德距离(Jaccard similarity coefficient)来计算相似度。可以将两个节点之间的相似度大小作为边权重,基于边权重大小可选择剪枝优化生成用户相似权重关联图。其中,所述剪枝优化,其核心是设置一个边权重阈值,若两个节点之间的关联边的权重(相似度)小于边权重阈值,可以剪掉该关联边。
生成用户关联图后,执行步骤S13,基于用户关联图,利用社区划分算法生成多个待判定团伙集合。请参照图3,利用社区划分算法生成多个待判定团伙集合,包括以下步骤。
S131,基于所述用户关联图,利用社区划分算法生成n个团伙集合,n为正整数。其中,所述社区划分算法包括图标签传播算法或GN(Girvan-Newman)算法。
S132,对于每个所述团伙,根据其用户数量的大小进行调整,以得到多个新的团伙集合。
S133,将所述多个新的团伙集合确定为多个待判定团伙集合。
对于步骤S132,调整每个团伙集合的用户数量的大小,以使得每个所述团伙集合中用户数量小于或等于极大阈值,并且用户数量小于极小阈值的团伙集合的数量小于或等于预设阈值。其中,所述极大阈值大于所述极小阈值。
当某个所述团伙集合中用户数量(比如不同的账号数量)大于极大阈值时,则继续使用社区划分算法进行划分以使所述团伙集合中用户数量小于或等于所述极大阈值。比如,假设极大阈值为20,当一个团伙集合中不同的账号数量超过20个时,继续使用社区划分算法进行划分以使该团伙集合中用户数量小于或等于20个。若用户数量小于极小阈值的团伙集合的数量大于所述预设阈值,则调用层次聚类算法对用户数量小于极小阈值的团伙集合进行凝聚,这里层次聚类可选层次凝聚法或***法。比如,假设极小阈值为3,预设阈值为15,若不同的账号数量小于3个团伙集合数超过15个,则调用层次聚类算法对用户数量小于3的团伙集合进行凝聚,使得不同的账号数量小于3个团伙集合数不超过15个。
仍以极大阈值为20、极小阈值为3、预设阈值为15为例进行说明。经过步骤S131-132后,生成m个新的团伙集合,这m个新的团伙集合中,每个团伙集合的用户数量都小于或等于20。这m个新的团伙集合中,团伙集合的用户数量小于3的团伙集合不超过15个。将每个新的团伙集合作为一个待判定团伙集合,因此得到m个待判定团伙集合。
生成待判定团伙集合后,执行步骤S14,对于每个待判定团伙集合,计算所述待判定团伙集合的嫌疑度。嫌疑度的计算方式包括但不限于以下三种方式。
第一种计算方式,参考图4,计算所述待判定团伙集合的嫌疑度,包括以下步骤。
S141a,从所述数据特征中选取目标数据特征。
S142a,根据所述目标数据特征在所述待判定团伙集合中的占比,计算所述待判定团伙集合的嫌疑度。
目标数据特征是指核心要素特征。在一个实施例中,可以基于业务和场景,指定目标数据特征。目标数据特征可以是UUID(Universally Unique Identifier,通用唯一标识码)、IP地址、操作位置信息、操作时间节点信息、设备类型信息、***版本信息中的一项或多项。举例来说,对于判断虚假注册这种场景,核心要素可以是IP地址、设备类型和注册时间,那么对于一个待判定团伙集合来说,那么所述目标数据特征可以是IP地址、设备类型和注册时间中的一个或多个。
在另一个实施例中,简单起见,可以直接使用生成用户关联图中所有的特征字段为 目标数据特征。
在再一个实施例中,可以计算某个数据特征在整体数据中的占比,以及该数据特征在待判定团伙集合中的占比。其中,整体数据是指所有用户的数据。若这两个占比之间的差异超过目标阈值,则将该数据特征作为目标数据特征。
举例来讲,以某客户端的账号数量100个为例,其中使用虚拟手机号注册账号的数量为8个,则使用虚拟手机号注册的账号在账号中的分布比例为8%。生成的某个待判定团伙集合中,账号数量是10个,其中有7个账号是使用虚拟手机号注册的,分布比例为70%,70%对比8%,差异性很大。则以虚拟手机号注册的账号为目标数据特征,所述目标数据特征在所述待判定团伙集合中的占比为0.7,可以将该占比作为所述待判定团伙集合的嫌疑度。
或者,以某客户端账号数量100个为例,其中历史嫌疑用户注册账号的数量为8个,则历史嫌疑用户注册的账号在账号中的分布比例为8%。生成的某个待判定团伙集合中,账号数量是10个,其中有8个账号是历史嫌疑用户注册的,分布比例为80%,80%对比8%,差异性很大。则以历史嫌疑用户注册的账号为目标数据特征,所述目标数据特征在所述待判定团伙集合中的占比为0.8,可以将该占比作为所述待判定团伙集合的嫌疑度。
第二种计算方式,参考图5,计算所述待判定团伙集合的嫌疑度,可以包括以下步骤。
S141b,抽取每个所述待判定团伙集合的团伙特征。其中,所述团伙特征至少包括历史嫌疑用户占比特征,还可以包括团伙规模、共享设备账号数量占比等特征。
S142b,将所述团伙特征输入训练好的回归模型中以使所述回归模型输出所述待判定团伙集合的嫌疑度。其中,所述回归模型可以是GBDT(Gradient Boosting Decision Tree;梯度提升决策树)模型。
第三种计算方式,参考图6,计算所述待判定团伙集合的嫌疑度,可以包括以下步骤。
S141c,从所述数据特征中选取目标数据特征。
S142c,根据所述目标数据特征在所述待判定团伙集合中的占比,计算所述待判定团伙集合的第一嫌疑度得分。
S143c,抽取每个所述待判定团伙集合的团伙特征。
S144c,将所述团伙特征输入训练好的回归模型中以使所述回归模型输出所述待判定团伙集合的第二嫌疑度得分。
S145c,根据所述第一嫌疑度得分以及所述第二嫌疑度得分,计算所述待判定团伙集合的综合嫌疑度得分,作为待判定团伙集合的嫌疑度。
接着根据计算结果,输出所述待判定团伙的判定结果。比如,当综合嫌疑度得分超过预设值时,则可以判定所述待判定团伙为欺诈团伙。
举例来讲,某个待判定团伙的第一嫌疑度得分为0.7,第二嫌疑度得分0.8,则所述待判定团伙集合的综合嫌疑度得分可以取两个得分的平均值0.75,超过预设值0.6,则该待判定团伙为欺诈团伙。
本公开根据获取的用户数据和历史嫌疑用户数据,生成用户关联图,并利用社区划分算法生成多个待判定团伙集合,对于每个待判定团伙集合,通过计算待判定团伙集合的嫌疑度,即可以分辨出该待判定团伙集合是否属于欺诈团伙,提供了识别团伙欺诈的准确性。另外,本公开还使用社区划分算法和层次聚类算法,解决了团伙划分结果中团伙规模过大、较小的团伙规模数量很多的问题。并且,本公开通过相似索引的手段来提升图模型数据处理能力,同时利用子图组装、相似边权重可配置地方式生成用户相似权重关联图作为用户关联图,这种方法更加灵活,可以进一步提升欺诈场景下的大规模数据处理能力。
值得说明的是,对于图1所示的方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本公开所必须的。
图7是本公开一示例性实施例示出的一种基于图模型检测团伙欺诈的装置。如图7所示,所述基于图模型检测团伙欺诈的装置300包括以下模块。
获取模块310,用于获取多个用户的数据和历史嫌疑用户数据。
第一生成模块320,用于根据获取的数据,生成用户关联图,其中,所述用户关联图的节点为根据数据特征生成的用户关联子图,所述用户关联图的边权重包括节点的相似度。
第二生成模块330,用于基于所述用户关联图,利用社区划分算法生成多个待判定团伙集合。
计算模块340,用于对于每个待判定团伙集合,计算所述待判定团伙集合的嫌疑度;
输出模块350,用于对于每个待判定团伙集合,根据所述嫌疑度的计算结果,输出所述待判定团伙的判定结果。
可选地,如图8所示,所述第一生成模块320包括:第一选取子模块321,用于选取所述多个用户数据和所述历史嫌疑用户数据中的特征组合和组数;第一生成子模块322,用于基于所述特征组合和所述组数,利用特征一致性相等或模糊性相等方式生成用户关联子图,并以所述用户关联子图为节点拼接生成用户无权重关联图;第二生成子模块323,用于以所述用户无权重关联图中节点的相似度作为边权重生成用户相似权重关联图作为所述用户关联图。
可选地,如图9所示,所述第二生成模块330包括:第三生成子模块331,用于基于所述用户关联图,利用社区划分算法生成n个团伙集合,n为正整数;调整子模块332,用于对于每个所述团伙集合,根据所述团伙集合的用户数量的大小进行调整,以得到多个新的团伙集合;确认子模块333,用于将所述多个新的团伙集合确定为多个待判定团伙集合。
可选地,如图10所示,所述调整子模块332还包括:划分单元3321,用于对用户数量大于极大阈值的团伙集合,调用社区划分算法进行划分,以使所述新的团伙集合中用户数量小于或等于所述极大阈值;凝聚单元3322,用于若用户数量小于极小阈值的团伙集合的数量大于预设阈值,调用层次聚类算法对所述用户数量小于极小阈值的所述团伙集合进行凝聚。
可选地,所述社区划分算法包括图标签传播算法或GN算法;所述层次聚类算法包括凝聚算法或***算法。
可选地,如图11所示,所述计算模块340包括:第二选取子模块341a,用于从所述数据特征中选取目标数据特征;第一计算子模块342a,用于根据所述目标数据特征在所述待判定团伙集合中的占比,计算所述待判定团伙集合的嫌疑度。
可选地,如图12所示,所述计算模块340包括:第一抽取子模块341b,用于抽取每个所述待判定团伙集合的团伙特征;第一输入子模块342b,用于将所述团伙特征输入训练好的回归模型中以使所述回归模型输出所述待判定团伙集合的嫌疑度。
可选地,如图13所示,所述计算模块340包括:第三选取子模块341c,用于从所述数据特征中选取目标数据特征;第二计算子模块342c,用于根据所述目标数据特征在所述待判定团伙集合中的占比,计算所述待判定团伙集合的第一嫌疑度得分;第二抽取子模块343c,用于抽取每个所述待判定团伙集合的团伙特征;第二输入子模块344c,用于将所述团伙特征输入训练好的回归模型中以使所述回归模型输出所述待判定团伙集合的第二嫌疑度得分;第三计算子模块345c,用于根据所述第一嫌疑度得分以及所述第二嫌疑度得分,计算所述待判定团伙集合的综合嫌疑度得分。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
本公开还提供一种非易失性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时,促使所述处理器实现上述任一项可选实施例的基于图模型检测团伙欺诈的方法步骤。
本公开还提供一种基于图模型检测团伙欺诈的装置,包括:存储器,其上存储有计算机程序;以及处理器,用于执行所述存储器中的所述计算机程序,以实现上述任一项可选实施例的基于图模型检测团伙欺诈的方法步骤。
图14是根据一示例性实施例示出的一种基于图模型检测团伙欺诈的装置400的框图。如图14所示,该装置400可以包括:处理器401,存储器402,多媒体组件403,输入/输出(Input/Output)接口404,以及通信组件405。
其中,处理器401用于控制该装置400的整体操作,以完成上述的基于图模型检测团伙欺诈的方法中的全部或部分步骤。存储器402用于存储各种类型的数据以支持在该装置400的操作,这些数据例如可以包括用于在该装置400上操作的任何应用程序或方法的指令,以及应用程序相关的数据。该存储器402可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,例如静态随机存取存储器(Static Random Access Memory,简称SRAM),电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,简称EEPROM),可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,简称EPROM),可编程只读存储器(Programmable Read-Only Memory,简称PROM),只读存储器(Read-Only Memory,简称ROM),磁存储器,快闪存储器,磁盘或光盘。多媒体组件403可以包括屏幕和音频组件。其中屏幕例如可以是触摸屏,音频组件用于输出和/或输入音频信号。例如,音频组件可以包括一个麦克风,麦克风用于接收外部音频信号。所接收的音频信号可以被进一步存储在存储器402 或通过通信组件405发送。音频组件还包括至少一个扬声器,用于输出音频信号。I/O接口404为处理器401和其他接口模块之间提供接口,上述其他接口模块可以是键盘,鼠标,按钮等。这些按钮可以是虚拟按钮或者实体按钮。通信组件405用于该装置400与其他设备之间进行有线或无线通信。无线通信,例如Wi-Fi,蓝牙,近场通信(Near Field Communication,简称NFC),2G、3G或4G,或它们中的一种或几种的组合,因此相应的该通信组件405可以包括:Wi-Fi模块,蓝牙模块,NFC模块。
在一示例性实施例中,装置400可以被一个或多个应用专用集成电路(Application Specific Integrated Circuit,简称ASIC)、数字信号处理器(Digital Signal Processor,简称DSP)、数字信号处理设备(Digital Signal Processing Device,简称DSPD)、可编程逻辑器件(Programmable Logic Device,简称PLD)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述的基于图模型检测团伙欺诈的方法。
在另一示例性实施例中,还提供了一种包括程序指令的计算机可读存储介质,例如包括程序指令的存储器402,上述程序指令可由装置400的处理器401执行以完成上述的基于图模型检测团伙欺诈的方法。
以上结合附图详细描述了本公开的优选实施方式,但是,本公开并不限于上述实施方式中的具体细节,在本公开的技术构思范围内,可以对本公开的技术方案进行多种简单变型,这些简单变型均属于本公开的保护范围。
另外需要说明的是,在上述具体实施方式中所描述的各个具体技术特征,在不矛盾的情况下,可以通过任何合适的方式进行组合。为了避免不必要的重复,本公开对各种可能的组合方式不再另行说明。
此外,本公开的各种不同的实施方式之间也可以进行任意组合,只要其不违背本公开的思想,其同样应当视为本公开所公开的内容。

Claims (10)

  1. 一种基于图模型检测团伙欺诈的方法,其特征在于,所述方法包括:
    获取多个用户的数据和历史嫌疑用户数据;
    根据获取的数据,生成用户关联图,其中,所述用户关联图的节点为根据数据特征生成的用户关联子图,所述用户关联图的边权重包括所述节点的相似度;
    基于所述用户关联图,利用社区划分算法生成多个待判定团伙集合;
    对于每个待判定团伙集合,计算所述待判定团伙集合的嫌疑度;
    对于每个待判定团伙集合,根据所述嫌疑度的计算结果,输出所述待判定团伙的判定结果。
  2. 根据权利要求1所述的方法,其特征在于,生成所述用户关联图,包括:
    选取所述多个用户的所述数据和所述历史嫌疑用户数据中的特征组合和组数;
    基于所述特征组合和所述组数,利用特征一致性相等或模糊性相等方式生成用户关联子图;
    以所述用户关联子图为节点拼接生成用户无权重关联图;
    以所述用户无权重关联图中节点的相似度作为边权重,生成用户相似权重关联图作为所述用户关联图。
  3. 根据权利要求1所述的方法,其特征在于,利用所述社区划分算法生成所述多个待判定团伙集合,包括:
    基于所述用户关联图,利用所述社区划分算法生成n个团伙集合,n为正整数;
    对于每个所述团伙集合,根据所述团伙集合的用户数量的大小进行调整,以得到多个新的团伙集合;
    将所述多个新的团伙集合确定为所述多个待判定团伙集合。
  4. 根据权利要求3所述的方法,其特征在于,根据所述团伙集合的所述用户数量的大小进行调整,包括:
    对用户数量大于极大阈值的团伙集合,调用所述社区划分算法进行划分,以使所述新的团伙集合中的用户数量小于或等于所述极大阈值;
    若用户数量小于极小阈值的团伙集合的数量大于预设阈值,调用层次聚类算法对所述用户数量小于极小阈值的所述团伙集合进行凝聚。
  5. 根据权利要求4所述的方法,其特征在于,
    所述社区划分算法包括图标签传播算法或GN算法;
    所述层次聚类算法包括凝聚算法或***算法。
  6. 根据权利要求1所述的方法,其特征在于,计算所述待判定团伙集合的所述嫌疑度,包括:
    从所述数据特征中选取目标数据特征;
    根据所述目标数据特征在所述待判定团伙集合中的占比,计算所述待判定团伙集合的所述嫌疑度。
  7. 根据权利要求1所述的方法,其特征在于,计算所述待判定团伙集合的所述嫌疑度,包括:
    抽取每个所述待判定团伙集合的团伙特征;
    将所述团伙特征输入训练好的回归模型中,以使所述回归模型输出所述待判定团伙集合的所述嫌疑度。
  8. 一种基于图模型检测团伙欺诈的装置,其特征在于,所述装置包括:
    获取模块,用于获取多个用户的数据和历史嫌疑用户数据;
    第一生成模块,用于根据获取的数据,生成用户关联图,其中,所述用户关联图的节点为根据所述数据的特征生成的用户关联子图,所述用户关联图的边权重包括所述节点的相似度;
    第二生成模块,用于基于所述用户关联图,利用社区划分算法生成多个待判定团伙集合;
    计算模块,用于对于每个待判定团伙集合,计算所述待判定团伙集合的嫌疑度;
    输出模块,用于对于每个待判定团伙集合,根据所述嫌疑度的计算结果,输出所述待判定团伙的判定结果。
  9. 一种非易失性计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时,促使所述处理器实现权利要求1至7中任一项所述方法的步骤。
  10. 一种基于图模型检测团伙欺诈的装置,其特征在于,包括:
    存储器,其上存储有计算机程序;以及
    处理器,用于执行所述存储器中的所述计算机程序,以实现权利要求1至7中任一项所述方法的步骤。
PCT/CN2019/124807 2019-03-27 2019-12-12 基于图模型检测团伙欺诈 WO2020192184A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910239821.3A CN110070364A (zh) 2019-03-27 2019-03-27 基于图模型检测团伙欺诈的方法和装置、存储介质
CN201910239821.3 2019-03-27

Publications (1)

Publication Number Publication Date
WO2020192184A1 true WO2020192184A1 (zh) 2020-10-01

Family

ID=67366679

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124807 WO2020192184A1 (zh) 2019-03-27 2019-12-12 基于图模型检测团伙欺诈

Country Status (2)

Country Link
CN (1) CN110070364A (zh)
WO (1) WO2020192184A1 (zh)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070364A (zh) * 2019-03-27 2019-07-30 北京三快在线科技有限公司 基于图模型检测团伙欺诈的方法和装置、存储介质
CN112651764B (zh) * 2019-10-12 2023-03-31 武汉斗鱼网络科技有限公司 一种目标用户识别方法、装置、设备和存储介质
CN110827159B (zh) * 2019-11-11 2023-11-03 上海交通大学 基于关系图的金融医疗保险诈骗预警方法、装置及终端
CN112907308B (zh) * 2019-11-19 2024-05-24 京东科技控股股份有限公司 数据检测方法和装置、计算机可读存储介质
CN111090729B (zh) * 2019-12-16 2024-04-09 深圳市卡牛科技有限公司 欺诈团伙的识别方法、装置、服务器和存储介质
CN111339436B (zh) * 2020-02-11 2021-05-28 腾讯科技(深圳)有限公司 一种数据识别方法、装置、设备以及可读存储介质
CN111325350B (zh) * 2020-02-19 2023-09-29 第四范式(北京)技术有限公司 可疑组织发现***和方法
CN111415241A (zh) * 2020-02-29 2020-07-14 深圳壹账通智能科技有限公司 欺诈人员识别方法、装置、设备和存储介质
CN111401959B (zh) * 2020-03-18 2023-09-29 多点(深圳)数字科技有限公司 风险群体的预测方法、装置、计算机设备及存储介质
CN111428217B (zh) * 2020-04-12 2023-07-28 中信银行股份有限公司 欺诈团伙识别方法、装置、电子设备及计算机可读存储介质
CN111476662A (zh) * 2020-04-13 2020-07-31 中国工商银行股份有限公司 反洗钱识别方法及装置
CN111709756A (zh) * 2020-06-16 2020-09-25 银联商务股份有限公司 一种可疑社团的识别方法、装置、存储介质和计算机设备
CN111931047B (zh) * 2020-07-31 2022-06-21 中国平安人寿保险股份有限公司 基于人工智能的黑产账号检测方法及相关装置
CN112184334A (zh) * 2020-10-27 2021-01-05 北京嘀嘀无限科技发展有限公司 用于确定问题用户的方法、装置、设备和介质
CN112308694A (zh) * 2020-11-24 2021-02-02 拉卡拉支付股份有限公司 一种欺诈团伙的发现方法及装置
CN112508456A (zh) * 2020-12-25 2021-03-16 平安国际智慧城市科技股份有限公司 食品安全风险评估方法、***、计算机设备及存储介质
CN113326178A (zh) * 2021-06-22 2021-08-31 北京奇艺世纪科技有限公司 一种异常账号传播方法、装置、电子设备和存储介质
CN113592517A (zh) * 2021-08-09 2021-11-02 深圳前海微众银行股份有限公司 欺诈客群识别方法、装置、终端设备及计算机存储介质
CN114820219B (zh) * 2022-05-23 2022-09-20 杭银消费金融股份有限公司 一种基于复杂网络的欺诈社团识别方法及***
CN115150052B (zh) * 2022-06-08 2023-04-07 北京天融信网络安全技术有限公司 攻击团伙的跟踪识别方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067873A1 (en) * 2012-06-26 2014-03-06 International Business Machines Corporation Efficient egonet computation in a weighted directed graph
CN105812195A (zh) * 2014-12-30 2016-07-27 阿里巴巴集团控股有限公司 计算机识别批量账户的方法和装置
CN107194623A (zh) * 2017-07-20 2017-09-22 深圳市分期乐网络科技有限公司 一种团伙欺诈的发现方法及装置
CN108681936A (zh) * 2018-04-26 2018-10-19 浙江邦盛科技有限公司 一种基于模块度和平衡标签传播的欺诈团伙识别方法
CN110070364A (zh) * 2019-03-27 2019-07-30 北京三快在线科技有限公司 基于图模型检测团伙欺诈的方法和装置、存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662964B (zh) * 2012-03-05 2015-05-27 北京千橡网景科技发展有限公司 对用户的好友进行分组的方法和装置
CN107527295B (zh) * 2017-08-24 2021-04-30 中南大学 基于时态合著网络的学术团队动态社区发现方法及其质量评估方法
CN107644098A (zh) * 2017-09-29 2018-01-30 马上消费金融股份有限公司 一种欺诈行为识别方法、装置、设备及存储介质
CN107784327A (zh) * 2017-10-27 2018-03-09 天津理工大学 一种基于gn的个性化社区发现方法
CN108764917A (zh) * 2018-05-04 2018-11-06 阿里巴巴集团控股有限公司 一种欺诈团伙的识别方法和装置
CN108898505B (zh) * 2018-05-28 2021-07-23 武汉斗鱼网络科技有限公司 作弊团伙的识别方法、相关存储介质和电子设备
CN109299811B (zh) * 2018-08-20 2021-02-02 众安在线财产保险股份有限公司 一种基于复杂网络的欺诈团伙识别和风险传播预测的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067873A1 (en) * 2012-06-26 2014-03-06 International Business Machines Corporation Efficient egonet computation in a weighted directed graph
CN105812195A (zh) * 2014-12-30 2016-07-27 阿里巴巴集团控股有限公司 计算机识别批量账户的方法和装置
CN107194623A (zh) * 2017-07-20 2017-09-22 深圳市分期乐网络科技有限公司 一种团伙欺诈的发现方法及装置
CN108681936A (zh) * 2018-04-26 2018-10-19 浙江邦盛科技有限公司 一种基于模块度和平衡标签传播的欺诈团伙识别方法
CN110070364A (zh) * 2019-03-27 2019-07-30 北京三快在线科技有限公司 基于图模型检测团伙欺诈的方法和装置、存储介质

Also Published As

Publication number Publication date
CN110070364A (zh) 2019-07-30

Similar Documents

Publication Publication Date Title
WO2020192184A1 (zh) 基于图模型检测团伙欺诈
US10965668B2 (en) Systems and methods to authenticate users and/or control access made by users based on enhanced digital identity verification
US20220122083A1 (en) Machine learning engine using following link selection
KR101814989B1 (ko) 블록 체인을 이용한 이상 금융 거래 탐지 방법 및 이를 실행하는 서버
US11546373B2 (en) Cryptocurrency based malware and ransomware detection systems and methods
US10977654B2 (en) Machine learning engine for fraud detection during cross-location online transaction processing
US11074350B2 (en) Method and device for controlling data risk
TWI804575B (zh) 確定高風險用戶的方法及裝置、電腦可讀儲存媒體、和計算設備
WO2020199621A1 (zh) 基于知识图谱检测欺诈
WO2019079708A1 (en) IMPROVED SYSTEM AND METHOD FOR EVALUATING IDENTITY USING GLOBAL NOTE VALUE
CN107665432A (zh) 在用户与各种银行服务的交互中识别可疑用户行为的***和方法
CN105389488B (zh) 身份认证方法及装置
US11605088B2 (en) Systems and methods for providing concurrent data loading and rules execution in risk evaluations
TW201828212A (zh) 調整風險參數的方法、風險識別方法及裝置
WO2021226878A1 (en) Using machine learning to mitigate electronic attacks
KR101720538B1 (ko) 비정상행위 탐색방법 및 탐색프로그램
KR102449632B1 (ko) 이상 금융거래 탐지 방법 및 시스템
CN110827033A (zh) 信息处理方法、装置及电子设备
CN114187112A (zh) 账户风险模型的训练方法和风险用户群体的确定方法
CN111598713B (zh) 基于相似度权重更新的团伙识别方法、装置及电子设备
WO2021053647A1 (en) Detection of use of malicious tools on mobile devices
US11924226B2 (en) Device analytics engine
CN109905366A (zh) 终端设备安全验证方法、装置、可读存储介质及终端设备
US12014226B2 (en) Microservice platform message management system
US20230012460A1 (en) Fraud Detection and Prevention System

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19921761

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19921761

Country of ref document: EP

Kind code of ref document: A1