WO2021003803A1

WO2021003803A1 - Data processing method and apparatus, storage medium and electronic device

Info

Publication number: WO2021003803A1
Application number: PCT/CN2019/101420
Authority: WO
Inventors: 顾全; 张文会
Original assignee: 同盾控股有限公司
Priority date: 2019-07-11
Filing date: 2019-08-19
Publication date: 2021-01-14
Also published as: CN110348516B; CN110348516A

Abstract

Disclosed are a data processing method and apparatus, a storage medium and an electronic device. The method comprises: acquiring, on the basis of a boosting tree model, a fraud probability value of data to be detected (S110); acquiring a first group according to a graph model and the fraud probability value of the data to be detected (S120); acquiring, from the data to be detected and on the basis of an association rule model and the fraud probability value of the data to be detected, a second group corresponding to a rule (S130); and determining, on the basis of the fraud probability value of the data to be detected, the first group and the second group, a target fraud group in the data to be detected (S140). A graph model and an association rule model are respectively fused with a boosting tree model, and fusion scoring is then carried out on results from the two models, such that advantages of various models are fused and the defect of each model and the under-fitting defect of a single model are overcome, thereby improving the accuracy of the identification of a fraud group.

Description

数据处理方法、装置、存储介质及电子设备Data processing method, device, storage medium and electronic equipment

本公开要求申请日为2019年07月11日、申请号为CN 201910625054.X、发明创造名称为《数据处理方法、装置、存储介质及电子设备》的中国发明专利申请的优先权。This disclosure requires the priority of a Chinese invention patent application whose application date is July 11, 2019, the application number is CN 201910625054.X, and the invention name is "Data Processing Methods, Devices, Storage Media and Electronic Equipment".

技术领域Technical field

本公开实施例涉及计算机技术领域，具体而言，涉及一种数据处理方法、装置、存储介质及电子设备。The embodiments of the present disclosure relate to the field of computer technology, and in particular, to a data processing method, device, storage medium, and electronic equipment.

背景技术Background technique

随着信息科技的发展,基于信息的欺诈行为越来越多,其中,很多都是团伙作案。With the development of information technology, there are more and more information-based frauds, many of which are committed by gangs.

目前比较常用的欺诈团伙识别方法是使用无监督聚类算法，例如K-Means、DBSCAN，或是半监督图聚类算法，如标签传播算法。At present, the most commonly used fraud group identification method is to use unsupervised clustering algorithms, such as K-Means, DBSCAN, or semi-supervised graph clustering algorithms, such as label propagation algorithm.

无监督聚类算法主要原理是不依靠标签，而是通过寻找样本特征数据的内在关联(距离)，来试图将样本划分成多个簇(cluster)，而达到分群的目的。例如,K-Means是将n个样本划分到k个簇中，使得每个点都属于离他最近的均值(此即聚类中心)对应的簇，以之作为聚类的标准。The main principle of the unsupervised clustering algorithm is not to rely on labels, but to try to divide the sample into multiple clusters by looking for the internal correlation (distance) of the sample feature data to achieve the purpose of clustering. For example, K-Means divides n samples into k clusters, so that each point belongs to the cluster corresponding to the nearest mean (this is the cluster center), which is used as the clustering standard.

半监督聚类算法则除了考虑样本特征数据之间的关联外，还一定程度上考虑到了样本的标签信息。例如,标签传播算法(Label Propagation Algorithm)是一种基于图的半监督学习方法，其基本思路是用已标记节点的标签信息去预测未标记节点的标签信息。该算法的时间复杂度和空间复杂度分别为O(n)和O(n2)，其中n为社区的节点数。The semi-supervised clustering algorithm not only considers the association between sample feature data, but also considers the label information of the sample to a certain extent. For example, the label propagation algorithm (Label Propagation Algorithm) is a semi-supervised learning method based on graphs. The basic idea is to use the label information of labeled nodes to predict the label information of unlabeled nodes. The time complexity and space complexity of this algorithm are O(n) and O(n2) respectively, where n is the number of nodes in the community.

在实现本发明的过程中，发明人发现上述欺诈团伙的识别方法至少存在以下技术问题：In the process of implementing the present invention, the inventor found that the above-mentioned method for identifying fraudulent groups has at least the following technical problems:

无监督聚类算法的缺点：无监督算法的缺点显而易见，由于没有考虑到样本的标签，再好的无监督算法都不能充分利用数据的价值，这是因为样本的标签往往对建模来说是其最重要的信息。此外，无监督聚类算法往往考虑的是样本间的距离，在样本特征不强、特征维度有限的情况下，空间距离较近的样本未必是相同的标签，距离较远的样本也未必是不同的标签，因此其聚类的结果也许会和真实标签差异较大。Disadvantages of unsupervised clustering algorithms: The disadvantages of unsupervised algorithms are obvious. Since the label of the sample is not taken into account, no good unsupervised algorithm can make full use of the value of the data. This is because the label of the sample is often useful for modeling Its most important information. In addition, unsupervised clustering algorithms often consider the distance between samples. In the case of sample features that are not strong and feature dimensions are limited, samples with closer spatial distance may not have the same label, and samples with farther distance may not necessarily be different. Therefore, the clustering result may be quite different from the real label.

半监督图聚类算法的缺点：尽管半监督算法考虑到了样本标签的信息，但基于既有标签直接给图上的未知样本打标容易造成精确率过低的问题。这是因为欺诈样本占总体比例始终是很小的(通常是千分之一级别)，因此与欺诈样本有过关联(这些关联包括手机号、联系人、直亲、Cookie等)的未知样本，依然有很大概率不是欺诈的。此外，上述这些关联的维度有限，无法充分利用到样本的其它特征信息，也无法进行有效的特征工程扩展维度，再加上每种关联维度之间的强弱无法确定，因此，半监督图聚类算法在实践当中效果并不突出。Disadvantages of the semi-supervised graph clustering algorithm: Although the semi-supervised algorithm takes into account the information of the sample labels, directly marking the unknown samples on the graph based on the existing labels easily causes the problem of low accuracy. This is because the proportion of fraud samples in the overall population is always very small (usually at the level of one-thousandth), so unknown samples that have been associated with fraud samples (these relationships include mobile phone numbers, contacts, immediate relatives, cookies, etc.). There is still a high probability that it is not fraudulent. In addition, the dimensions of the above-mentioned associations are limited, and other feature information of the sample cannot be fully utilized, and effective feature engineering cannot be carried out to expand the dimensions. In addition, the strength of each association dimension cannot be determined. Therefore, the semi-supervised graph cluster The effect of similar algorithms is not outstanding in practice.

因此，需要一种新的数据处理方法、装置、电子设备及计算机可读介质。Therefore, a new data processing method, device, electronic device and computer readable medium are needed.

在所述背景技术部分公开的上述信息仅用于加强对本公开的背景的理解，因此它可以包括不构成对本领域普通技术人员已知的现有技术的信息。The above-mentioned information disclosed in the background section is only used to enhance the understanding of the background of the present disclosure, so it may include information that does not constitute the prior art known to those of ordinary skill in the art.

发明内容Summary of the invention

有鉴于此，本发明提供一种数据处理方法、装置、存储介质及电子设备，提升了识别欺诈群组的准确度。In view of this, the present invention provides a data processing method, device, storage medium, and electronic equipment to improve the accuracy of identifying fraud groups.

本发明的其他特性和优点将通过下面的详细描述变得显然，或部分地通过本发明的实践而习得。Other characteristics and advantages of the present invention will become apparent through the following detailed description, or partly learned through the practice of the present invention.

根据本发明实施例的第一方面，提供一种数据处理方法，其中，所述方法包括：According to a first aspect of the embodiments of the present invention, there is provided a data processing method, wherein the method includes:

基于提升树模型获取待检测数据的欺诈概率值；Obtain the fraud probability value of the data to be detected based on the boosting tree model;

根据图模型以及所述待检测数据的欺诈概率值获取第一群组；Obtaining the first group according to the graph model and the fraud probability value of the data to be detected;

基于关联规则模型以及所述待检测数据的欺诈概率值,从所述待检测数据中获取规则对应的第二群组；Obtaining the second group corresponding to the rule from the data to be detected based on the association rule model and the fraud probability value of the data to be detected;

基于所述待检测数据的欺诈概率值、所述第一群组以及所述第二群组确定所述待检测数据中的目标欺诈群组。A target fraud group in the data to be detected is determined based on the fraud probability value of the data to be detected, the first group and the second group.

在本发明的一些示例性实施例中，基于前述方案，根据图模型以及所述待检测数据的欺诈概率值获取第一群组之前，所述方法包括：In some exemplary embodiments of the present invention, based on the foregoing solution, before obtaining the first group according to the graph model and the fraud probability value of the data to be detected, the method includes:

以每个待检测数据作为顶点表，提取所述待检测数据中相同的维度特征作为边表，并根据所述各维度特征的权重计算出所述边表的关联值；Taking each data to be detected as a vertex table, extracting the same dimensional feature in the data to be detected as an edge table, and calculating the associated value of the edge table according to the weight of each dimensional feature;

根据所述顶点表、所述边表以及所述边表的关联值生成所述待检测数据的图数据。The graph data of the data to be detected is generated according to the associated values of the vertex table, the edge table, and the edge table.

在本发明的一些示例性实施例中，基于前述方案，根据图模型以及所述待检测数据的欺诈概率值获取第一群组,包括：In some exemplary embodiments of the present invention, based on the foregoing solution, obtaining the first group according to the graph model and the fraud probability value of the data to be detected includes:

基于图模型获取所述待检测数据的多个特征群组；Acquiring multiple feature groups of the data to be detected based on a graph model;

获取所述多个特征群组中每个特征群组内欺诈概率值超过欺诈阈值的待检测数据；Acquiring data to be detected whose fraud probability value in each feature group in the multiple feature groups exceeds a fraud threshold;

筛选出所述欺诈概率值超过欺诈阈值的待检测数据所占对应的特征群组内的待检测数据的比例超过比例阈值的特征群组，所述特征群组为第一群组。The feature group whose proportion of the to-be-detected data whose fraud probability value exceeds the fraud threshold value in the corresponding feature group exceeds the ratio threshold is filtered out, and the feature group is the first group.

在本发明的一些示例性实施例中，基于前述方案，所述方法还包括：获取所述关联规则模型；In some exemplary embodiments of the present invention, based on the foregoing solution, the method further includes: acquiring the association rule model;

获取样本数据；Obtain sample data;

基于关联规则初始模型获取所述样本数据的多个规则群组；Acquiring multiple rule groups of the sample data based on the initial model of association rules;

基于所述多个规则群组内样本数据的真实结果确定每个规则群组对应的规则的提升度；Determining the promotion degree of the rule corresponding to each rule group based on the real results of the sample data in the multiple rule groups;

筛选出所述提升度超过提升度阈值的规则群组；Filter out the rule groups whose promotion degree exceeds the promotion degree threshold;

基于所述规则群组获取所述关联规则模型；其中，所述关联规则模型能够获取所述规则群组对应的规则以及所述规则的提升度。The association rule model is obtained based on the rule group; wherein, the association rule model can obtain the rule corresponding to the rule group and the promotion degree of the rule.

在本发明的一些示例性实施例中，基于前述方案，基于关联规则模型以及所述待检测数据的欺诈概率值,从所述待检测数据中获取规则对应的第二群组，包括：In some exemplary embodiments of the present invention, based on the foregoing solution, based on the association rule model and the fraud probability value of the data to be detected, obtaining the second group corresponding to the rule from the data to be detected includes:

筛选出所述待检测数据的欺诈概率值超过所述欺诈阈值的待检测数据；Screening out the data to be detected whose fraud probability value of the data to be detected exceeds the fraud threshold;

将所述待检测数据输入至所述关联规则模型，以获取所述规则对应的第二群组。The data to be detected is input into the association rule model to obtain the second group corresponding to the rule.

在本发明的一些示例性实施例中，基于前述方案，基于所述待检测数据的欺诈概率值、所述第一群组以及所述第二群组确定所述待检测数据中的目标欺诈群组，包括：In some exemplary embodiments of the present invention, based on the foregoing solution, the target fraud group in the data to be detected is determined based on the fraud probability value of the data to be detected, the first group and the second group Group, including:

基于所述第一群组获取所述第一群组的直间度距离；Obtaining the straightness distance of the first group based on the first group;

基于所述待检测数据的欺诈概率值确定打分模型；Determining a scoring model based on the fraud probability value of the data to be detected;

将所述欺诈概率值、所述第一群组、所述第一群组的直间度距离、所述第二群组以及所述规则的提升度输入至所述打分模型，确定所述待检测数据中的目标欺诈群组。Input the fraud probability value, the straightness distance of the first group, the first group, the second group, and the lift of the rule into the scoring model to determine the waiting Detect target fraud groups in the data.

在本发明的一些示例性实施例中，基于前述方案，基于所述第一群组获取所述第一群组的直间度距离，包括：In some exemplary embodiments of the present invention, based on the foregoing solution, obtaining the straightness distance of the first group based on the first group includes:

基于所述图数据中所述第一群组内的每个待检测数据与超过所述欺诈阈值的待检测数据的距离，获取所述第一群组的直间度距离。Obtain the straightness distance of the first group based on the distance between each data to be detected in the first group in the graph data and the data to be detected that exceeds the fraud threshold.

在本发明的一些示例性实施例中，基于前述方案，基于所述待检测数据的欺诈概率值确定打分模型，包括：In some exemplary embodiments of the present invention, based on the foregoing solution, the scoring model is determined based on the fraud probability value of the data to be detected, including:

将初始打分模型中获取的欺诈群组的分数映射到所述欺诈群组内的每个待检测数据，得到所述欺诈群组内的每个待检测数据的分数；Mapping the score of the fraud group obtained in the initial scoring model to each data to be detected in the fraud group to obtain the score of each data to be detected in the fraud group;

基于所述欺诈群组内的每个待检测数据的分数以及欺诈概率值，确定所述初始打分模型中的权重；Determining the weight in the initial scoring model based on the score of each to-be-detected data in the fraud group and the fraud probability value;

基于所述权重得到所述打分模型。The scoring model is obtained based on the weight.

根据本发明实施例的第二方面，提供一种数据处理装置，其中，所述装置包括：According to a second aspect of the embodiments of the present invention, there is provided a data processing device, wherein the device includes:

第一获取模块，配置为基于提升树模型获取待检测数据的欺诈概率值；The first obtaining module is configured to obtain the fraud probability value of the data to be detected based on the boosting tree model;

第二获取模块，配置为根据图模型以及所述待检测数据的欺诈概率值获取第一群组；The second obtaining module is configured to obtain the first group according to the graph model and the fraud probability value of the data to be detected;

第三获取模块，配置为基于关联规则模型以及所述待检测数据的欺诈概率值,从所述待检测数据中获取规则对应的第二群组；The third obtaining module is configured to obtain the second group corresponding to the rule from the data to be detected based on the association rule model and the fraud probability value of the data to be detected;

确定模块，配置为基于所述待检测数据的欺诈概率值、所述第一群组以及所述第二群组确定所述待检测数据中的目标欺诈群组。The determining module is configured to determine a target fraud group in the data to be detected based on the fraud probability value of the data to be detected, the first group and the second group.

在本发明的一些示例性实施例中，基于前述方案，所述装置还包括：预处理模块，配置为以每个待检测数据作为顶点表，提取所述待检测数据中相同的维度特征作为边表，并根据所述各维度特征的权重计算出所述边表的关联值；以及根据所述顶点表、所述边表以及所述边表的关联值生成所述待检测数据的图数据。In some exemplary embodiments of the present invention, based on the foregoing solution, the device further includes: a preprocessing module configured to use each data to be detected as a vertex table, and extract the same dimensional features in the data to be detected as edges And calculate the associated value of the edge table according to the weight of each dimension feature; and generate the graph data of the data to be detected according to the associated value of the vertex table, the edge table, and the edge table.

在本发明的一些示例性实施例中，基于前述方案，所述第二获取模块,包括：In some exemplary embodiments of the present invention, based on the foregoing solution, the second acquisition module includes:

第一获取单元，配置为基于图模型获取所述待检测数据的多个特征群组；The first obtaining unit is configured to obtain multiple feature groups of the data to be detected based on a graph model;

第二获取单元，配置为获取所述多个特征群组中每个特征群组内欺诈概率值超过欺诈阈值的待检测数据；The second acquiring unit is configured to acquire data to be detected whose fraud probability value exceeds a fraud threshold in each of the multiple feature groups;

筛选单元，配置为筛选出所述欺诈概率值超过欺诈阈值的待检测数据所占对应的特征群组内的待检测数据的比例超过比例阈值的特征群组，所述特征群组为第一群组。The screening unit is configured to screen out the feature group whose proportion of the data to be detected with the fraud probability value exceeding the fraud threshold in the corresponding feature group to the data to be detected exceeds the ratio threshold, and the feature group is the first group group.

在本发明的一些示例性实施例中，基于前述方案，所述装置还包括：规则获取模块，配置为获取所述关联规则模型；所述规则获取模块，包括：In some exemplary embodiments of the present invention, based on the foregoing solution, the device further includes: a rule acquisition module configured to acquire the association rule model; and the rule acquisition module includes:

第一获取单元，配置为获取样本数据；The first obtaining unit is configured to obtain sample data;

第二获取单元，配置为基于关联规则初始模型获取所述样本数据的多个规则群组；The second acquiring unit is configured to acquire multiple rule groups of the sample data based on the initial model of the association rules;

确定单元，配置为基于所述多个规则群组内样本数据的真实结果确定每个规则群组对应的规则的提升度；The determining unit is configured to determine the promotion degree of the rule corresponding to each rule group based on the real result of the sample data in the multiple rule groups;

筛选单元，配置为筛选出所述提升度超过提升度阈值的规则群组；The screening unit is configured to filter out the rule groups whose lift exceeds the lift threshold;

第三获取单元，配置为基于所述规则群组获取所述关联规则模型；其中，所述关联规则模型能够获取所述规则群组对应的规则以及所述规则的提升度。The third obtaining unit is configured to obtain the association rule model based on the rule group; wherein the association rule model can obtain the rule corresponding to the rule group and the promotion degree of the rule.

在本发明的一些示例性实施例中，基于前述方案，所述第三获取模块，配置为筛选出所述待检测数据的欺诈概率值超过所述欺诈阈值的待检测数据；以及将所述待检测数据输入至所述关联规则模型，以获取所述规则对应的第二群组。In some exemplary embodiments of the present invention, based on the foregoing solution, the third acquisition module is configured to screen out the data to be detected whose fraud probability value of the data to be detected exceeds the fraud threshold; and The detection data is input to the association rule model to obtain the second group corresponding to the rule.

在本发明的一些示例性实施例中，基于前述方案，所述确定模块，配置为基于所述第一群组获取所述第一群组的直间度距离；基于所述待检测数据的欺诈概率值确定打分模型；将所述欺诈概率值、所述第一群组、所述第一群组的直间度距离、所述第二群组以及所述规则的提升度输入至所述打分模型，确定所述待检测数据中的目标欺诈群组。In some exemplary embodiments of the present invention, based on the foregoing solution, the determining module is configured to obtain the straightness distance of the first group based on the first group; fraud based on the data to be detected The probability value determines the scoring model; the fraud probability value, the first group, the straightness distance of the first group, the second group, and the lift of the rule are input to the scoring The model determines the target fraud group in the data to be detected.

在本发明的一些示例性实施例中，基于前述方案，所述确定模块，配置为基于所述图数据中所述第一群组内的每个待检测数据与超过所述欺诈阈值的待检测数据的距离，获取所述第一群组的直间度距离。In some exemplary embodiments of the present invention, based on the foregoing solution, the determining module is configured to be based on each data to be detected in the first group in the graph data and the data to be detected that exceed the fraud threshold. The distance of the data, the straightness distance of the first group is obtained.

在本发明的一些示例性实施例中，基于前述方案，所述确定模块，配置为将初始打分模型中获取的欺诈群组的分数映射到所述欺诈群组内的每个待检测数据，得到所述欺诈群组内的每个待检测数据的分数；基于所述欺诈群组内的每个待检测数据的分数以及欺诈概率值，确定所述初始打分模型中的权重；基于所述权重得到所述打分模型。根据本发明实施例的第三方面，提供一种计算机可读存储介质，其上存储有计算机程序，其中，该程序被处理器执行时实现第一方面所述的方法步骤。In some exemplary embodiments of the present invention, based on the foregoing solution, the determining module is configured to map the score of the fraud group obtained in the initial scoring model to each data to be detected in the fraud group to obtain The score of each data to be detected in the fraud group; based on the score of each data to be detected in the fraud group and the fraud probability value, the weight in the initial scoring model is determined; and the weight is obtained based on the weight The scoring model. According to a third aspect of the embodiments of the present invention, there is provided a computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to implement the method steps described in the first aspect.

根据本发明实施例的第四方面，提供一种电子设备，其中，包括：一个或多个处理器；存储装置，用于存储一个或多个程序，当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现如第一方面所述的方法步骤。According to a fourth aspect of the embodiments of the present invention, there is provided an electronic device, which includes: one or more processors; and a storage device for storing one or more programs, when the one or more programs are When executed by one or more processors, the one or more processors implement the method steps described in the first aspect.

本发明实施例中，基于提升树模型获取待检测数据的欺诈概率值；根据图模型以及所述待检测数据的欺诈概率值获取第一群组；基于关联规则模型以及所述待检测数据的欺诈概率值,从所述待检测数据中获取规则对应的第二群组；基于所述待检测数据的欺诈概率值、所述第一群组以及所述第二群组确定所述待检测数据中的目标欺诈群组。通过图模型以及关联规则模型分别与提升树模型融合，然后将这两种模型的结果进行融合打分，融合了多种模型的优点、克服了每种模型的缺点以及单一模型欠拟合的缺点，提升了识别欺诈群组的准确度。In the embodiment of the present invention, the fraud probability value of the data to be detected is obtained based on the boosting tree model; the first group is obtained according to the graph model and the fraud probability value of the data to be detected; the fraud based on the association rule model and the data to be detected The probability value is to obtain the second group corresponding to the rule from the data to be detected; determine the data in the data to be detected based on the fraud probability value of the data to be detected, the first group and the second group The target fraud group for. The graph model and the association rule model are respectively fused with the boosting tree model, and then the results of the two models are merged and scored, which combines the advantages of multiple models, overcomes the shortcomings of each model and the shortcomings of underfitting of a single model. Improved the accuracy of identifying fraud groups.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the present disclosure.

附图说明Description of the drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本发明的实施例，并与说明书一起用于解释本发明的原理。显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。在附图中：The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments in accordance with the present invention, and together with the specification are used to explain the principle of the present invention. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work. In the attached picture:

图1是根据一示例性实施例示出的一种数据处理方法的流程图；Fig. 1 is a flow chart showing a data processing method according to an exemplary embodiment;

图2是本发明实施例示出的一种图数据的示意图；Fig. 2 is a schematic diagram of graph data shown in an embodiment of the present invention;

图3是根据一示例性实施例示出的获取第一群组的方法的流程图；Fig. 3 is a flowchart showing a method for obtaining a first group according to an exemplary embodiment;

图4是根据一示例性实施例示出的获取关联规则模型的方法的流程图；Fig. 4 is a flow chart showing a method for obtaining an association rule model according to an exemplary embodiment;

图5是根据一示例性实施例示出的利用样本数据获取到打分模型的方法的流程图；Fig. 5 is a flow chart showing a method for obtaining a scoring model by using sample data according to an exemplary embodiment;

图6是根据一示例性实施例示出的一种模型间数据流转的示意图；Fig. 6 is a schematic diagram showing data flow between models according to an exemplary embodiment;

图7是根据一示例性实施例示出的一种数据处理装置的结构示意图；Fig. 7 is a schematic structural diagram showing a data processing device according to an exemplary embodiment;

图8是根据一示例性实施例示出的一种电子设备的结构示意图。Fig. 8 is a schematic structural diagram of an electronic device according to an exemplary embodiment.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施例。然而，示例实施例能够以多种形式实施，且不应被理解为限于在此阐述的实施例；相反，提供这些实施例使得本发明将全面和完整，并将示例实施例的构思全面地传达给本领域的技术人员。在图中相同的附图标记表示相同或类似的部分，因而将省略对它们的重复描述。Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms and should not be construed as being limited to the embodiments set forth herein; on the contrary, these embodiments are provided so that the present invention will be comprehensive and complete, and fully convey the concept of the example embodiments To those skilled in the art. In the figures, the same reference numerals denote the same or similar parts, and thus their repeated description will be omitted.

此外，所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中，提供许多具体细节从而给出对本发明的实施例的充分理解。然而，本领域技术人员将意识到，可以实践本发明的技术方案而没有特定细节中的一个或更多，或者可以采用其它的方法、组元、装置、步骤等。在其它情况下，不详细示出或描述公知方法、装置、实现或者操作以避免模糊本发明的各方面。Furthermore, the described features, structures or characteristics may be combined in one or more embodiments in any suitable manner. In the following description, many specific details are provided to give a sufficient understanding of the embodiments of the present invention. However, those skilled in the art will realize that the technical solutions of the present invention can be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. can be used. In other cases, well-known methods, devices, implementations or operations are not shown or described in detail to avoid obscuring aspects of the present invention.

附图中所示的方框图仅仅是功能实体，不一定必须与物理上独立的实体相对应。即，可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the drawings are merely functional entities, and do not necessarily correspond to physically independent entities. That is, these functional entities can be implemented in the form of software, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices. entity.

附图中所示的流程图仅是示例性说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解，而有的操作/步骤可以合并或部分合并，因此实际执行的顺序有可能根据实际情况改变。The flowchart shown in the drawings is only an exemplary description, and does not necessarily include all contents and operations/steps, nor does it have to be performed in the described order. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partially combined, so the actual execution order may be changed according to actual conditions.

应理解，虽然本文中可能使用术语第一、第二、第三等来描述各种组件，但这些组件不应受这些术语限制。这些术语乃用以区分一组件与另一组件。因此，下文论述的第一组件可称为第二组件而不偏离本公开概念的教示。如本文中所使用，术语“及/或”包括相关联的列出项目中的任一个及一或多者的所有组合。It should be understood that although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one component from another. Therefore, the first component discussed below may be referred to as the second component without departing from the teaching of the concepts of the present disclosure. As used herein, the term "and/or" includes any one and all combinations of one or more of the associated listed items.

下面结合具体的实施例，对本发明实施例提出的数据处理方法进行详细的说明。需要说明的是，执行本发明实施例的执行主体可以包括具有计算处理能力的装置执行，例如：服务器和/或终端设备，但本发明并不限于此。The data processing method proposed in the embodiment of the present invention will be described in detail below in conjunction with specific embodiments. It should be noted that the execution subject for executing the embodiment of the present invention may include a device with computing processing capability, such as a server and/or terminal device, but the present invention is not limited to this.

图1是根据一示例性实施例示出的一种数据处理方法的流程图。Fig. 1 is a flow chart showing a data processing method according to an exemplary embodiment.

如图1所示，该方法可以包括但不限于以下步骤：As shown in Figure 1, the method may include but is not limited to the following steps:

在S110中，基于提升树模型获取待检测数据的欺诈概率值。In S110, the fraud probability value of the data to be detected is obtained based on the boosted tree model.

本发明实施例中，待检测数据可以是至少一个待检测数据，在获取到待检测数据后，可以提取待检测数据的多维的特征。基于该待检测数据的多维特征，可以构造出更多维的特征，如交叉特征、聚合特征、窗口特征、OneHot特征等多项特征，特征的数目可以是500多维，从而充分利用待检测数据的特征信息。特征可以包括但不限于：手机号、联系人、直亲、Cookie、姓氏、地区、年龄、性别、职业等。In the embodiment of the present invention, the data to be detected may be at least one data to be detected. After the data to be detected is obtained, the multi-dimensional features of the data to be detected can be extracted. Based on the multi-dimensional features of the data to be detected, more dimensional features can be constructed, such as cross-features, aggregate features, window features, OneHot features, etc. The number of features can be 500 multi-dimensional, so as to make full use of the data to be detected. Characteristic information. Features may include, but are not limited to: mobile phone number, contact person, immediate relative, cookie, last name, region, age, gender, occupation, etc.

根据本发明实施例，在获取到待检测数据后，还可以对待检测数据进行过采样，去掉一些信息不全、信息错误的待检测数据，然后对提升树模型进行贝叶斯参数调优，从而使基于提升树模型获取待检测数据的欺诈概率值更加准确。According to the embodiment of the present invention, after the data to be detected is acquired, the data to be detected can also be oversampled to remove some of the data to be detected with incomplete or incorrect information, and then perform Bayesian parameter tuning on the boosting tree model, so that It is more accurate to obtain the fraud probability value of the data to be detected based on the boosting tree model.

本发明实施例中，提升树模型可以具体是LightGBM，LightGBM是由微软公司开发和开源的二阶梯度提升树模型，树之间通过Boosting框架进行集成。相比较而言，它比一阶梯度模型(如GBDT)收敛更快、拟合能力更强、准召率更高。In the embodiment of the present invention, the boosting tree model may specifically be LightGBM, which is a two-step boosting tree model developed and open sourced by Microsoft Corporation, and the trees are integrated through the Boosting framework. In comparison, it has faster convergence, stronger fitting ability and higher quasi-call rate than a step model (such as GBDT).

本发明实施例中，LightGBM输出的欺诈概率值(Probs)一方面会作为LouVain模型分群结果的筛选，这样可以找到高风险的第一群组；另一方面，由欺诈概率值(Probs)进行阈值调整得到的待检测数据作为关联规则模型的输入，可用于发现高提升度的共性规则的第二群组。In the embodiment of the present invention, the fraud probability value (Probs) output by LightGBM will be used as the screening result of LouVain model grouping on the one hand, so that the first group with high risk can be found; on the other hand, the fraud probability value (Probs) is used as the threshold The adjusted data to be detected is used as the input of the association rule model and can be used to find the second group of common rules with high promotion.

在S120中，根据图模型以及所述待检测数据的欺诈概率值获取第一群组。In S120, the first group is obtained according to the graph model and the fraud probability value of the data to be detected.

本发明实施例中，在获取到待检测数据后，可以对待检测数据进行预处理，以获取待检测数据的图数据。In the embodiment of the present invention, after the data to be detected is obtained, the data to be detected may be preprocessed to obtain graph data of the data to be detected.

本发明实施例中，在获取待检测数据的图数据时，以每个待检测数据作为顶点表，提取所述待检测数据中相同的维度特征作为边表，并根据所述各维度特征的权重计算出所述边表的关联值，从而根据所述顶点表、所述边表以及所述边表的关联值生成所述待检测数据的图数据。In the embodiment of the present invention, when the graph data of the data to be detected is obtained, each data to be detected is used as a vertex table, the same dimensional feature in the data to be detected is extracted as an edge table, and the weight of each dimensional feature is The correlation value of the edge table is calculated, so as to generate graph data of the data to be detected according to the correlation value of the vertex table, the edge table, and the edge table.

例如，待检测数据包括A、B、C、D，其中，则以A、B、C分别作为顶点表，假设A和B的手机号和姓氏相同，B和C的联系人相同，C和D的直亲相同，预设手机号特征维度的权重为4，联系人特征维度的权重为3，直亲特征维度的权重为2，姓氏特征维度的权重为1，则可以计算出A和B之间存在边表，该边表的关联值为手机号对应的权重与姓氏对应的权重之和：4+1＝5，B和C之间存在边表，该边表的关联值为联系人对应的权重3，C和D之间存在边表，该边表的关联值为直亲对应的权重2，其对应的图数据如图2所示，图2是本发明实施例示出的一种图数据的示意图。For example, the data to be detected includes A, B, C, and D, where A, B, and C are respectively used as the vertex table, assuming that the mobile phone numbers and surnames of A and B are the same, and the contacts of B and C are the same, and C and D The same as the immediate relatives, the preset mobile phone number feature dimension weight is 4, the contact feature dimension weight is 3, the immediate relative feature dimension weight is 2, and the surname feature dimension weight is 1, then the difference between A and B can be calculated There is an edge table between, and the associated value of the edge table is the sum of the weight corresponding to the mobile phone number and the weight corresponding to the last name: 4+1=5, there is an edge table between B and C, and the associated value of the edge table corresponds to the contact There is an edge table between C and D, and the associated value of the edge table is the weight 2 corresponding to the straight relative. The corresponding graph data is shown in Figure 2, which is a graph shown in an embodiment of the present invention. Schematic representation of the data.

本发明实施例中，在获取到待检测数据的图数据后，可以根据图模型以及所述待检测数据的欺诈概率值获取第一群组，该第一群组的数目可以是至少一个。In the embodiment of the present invention, after obtaining the graph data of the data to be detected, the first group may be obtained according to the graph model and the fraud probability value of the data to be detected, and the number of the first group may be at least one.

本发明实施例中，图模型可以是模块度社区发现LouVain模型，LouVain模型是一种基于模块度(Modularity)的图社区发现算法，可以用于网络图分群，比起其它图算法，它的分群结果更加稳定。In the embodiment of the present invention, the graph model can be a modularity community discovery LouVain model, and the LouVain model is a modularity-based graph community discovery algorithm, which can be used for network graph clustering. Compared with other graph algorithms, its clustering The result is more stable.

在S130中，基于关联规则模型以及所述待检测数据的欺诈概率值,从所述待检测数据中获取规则对应的第二群组。In S130, based on the association rule model and the fraud probability value of the data to be detected, the second group corresponding to the rule is obtained from the data to be detected.

本发明实施例中，在获取第二群组时，可以基于关联规则模型从构造出更多维的特征的待检测数据中获取第二群组，第二群组的数目可以是至少一个。In the embodiment of the present invention, when acquiring the second group, the second group may be acquired from the data to be detected with more dimensional features constructed based on the association rule model, and the number of the second group may be at least one.

本发明实施例中，可以基于样本数据获取到针对某(些)规则的关联规则模型。然后，可以基于欺诈阈值，对获取到的待检测数据的欺诈概率值进行过滤，筛选出超过该欺诈阈值的待检测数据，然后将筛选出的待检测数据输入至该关联规则模型，能够输出该筛选出的待检测数据中每条规则对应的第二群组。In the embodiment of the present invention, an association rule model for a certain rule(s) can be obtained based on sample data. Then, based on the fraud threshold, the obtained fraud probability value of the data to be detected can be filtered, and the data to be detected that exceed the fraud threshold can be filtered out, and then the filtered data to be detected can be input into the association rule model, and the data can be output. The second group corresponding to each rule in the filtered data to be detected.

本发明实施例中，关联规则模型Association Rules，包含了一整套的算法和流程，而非特指某个算法。例如，关联规则模型可以涵盖以下算法：Apriori，Eclat，FP-Growth，Ripper以及C50。In the embodiment of the present invention, the association rules model Association Rules includes a complete set of algorithms and processes, rather than specifically referring to a certain algorithm. For example, the association rule model can cover the following algorithms: Apriori, Eclat, FP-Growth, Ripper and C50.

在S140中，基于所述待检测数据的欺诈概率值、所述第一群组以及所述第二群组确定所述待检测数据中的目标欺诈群组。In S140, a target fraud group in the data to be detected is determined based on the fraud probability value of the data to be detected, the first group and the second group.

本发明实施例中，基于第一群组可以获取所述第一群组的直间度距离，基于第二群组可以获取到其对应的规则的提升度。基于所述待检测数据的欺诈概率值可以确定打分模型，通过将欺诈概率值、所述第一群组、所述第一群组的直间度距离、所述第二群组、所述规则的提升度输入至打分模型，可以输出欺诈群组以及各欺诈群组的得分，然后基于得分对欺诈群组进行排序以及筛选，从欺诈群组中确定目标欺诈群组。In the embodiment of the present invention, the straightness distance of the first group can be obtained based on the first group, and the lift of the corresponding rule can be obtained based on the second group. A scoring model can be determined based on the fraud probability value of the data to be detected, by combining the fraud probability value, the straightness distance between the first group and the first group, the second group, and the rule The improvement of is input to the scoring model, which can output the fraud group and the score of each fraud group, and then sort and filter the fraud group based on the score, and determine the target fraud group from the fraud group.

下面结合具体的实施例，对本发明实施例中获取第一群组的方法进行详细的说明。The method for obtaining the first group in the embodiment of the present invention will be described in detail below with reference to specific embodiments.

图3是根据一示例性实施例示出的获取第一群组的方法的流程图。Fig. 3 is a flowchart showing a method for acquiring a first group according to an exemplary embodiment.

如图3所示，该方法可以包括但不限于以下步骤：As shown in Figure 3, the method may include but is not limited to the following steps:

在S310中，基于图模型获取所述待检测数据的多个特征群组。In S310, multiple feature groups of the data to be detected are acquired based on the graph model.

本发明实施例中，在获取到待检测数据的图数据后，基于图模型获取待检测数据的多个特征群组。其中，每个特征群组中该特征相同的待检测数据为至少2个。例如，手机号群组中，包括A、B、C、D、E五个待检测数据，其中，A和B的手机号相同，C、D和E的手机号相同。In the embodiment of the present invention, after obtaining the graph data of the data to be detected, multiple feature groups of the data to be detected are obtained based on the graph model. There are at least two data to be detected with the same feature in each feature group. For example, the mobile phone number group includes five data to be detected, A, B, C, D, and E, where the mobile phone numbers of A and B are the same, and the mobile phone numbers of C, D, and E are the same.

在S320中，获取所述多个特征群组中每个特征群组内欺诈概率值超过欺诈阈值的待检测数据。In S320, obtain the to-be-detected data whose fraud probability value in each feature group of the multiple feature groups exceeds a fraud threshold.

本发明实施例中，基于S110中获取到的每个待检测数据的欺诈概率值，可以查找到每个特征群组内的待检测数据的欺诈概率值。将每个特征群组内的待检测数据的欺诈概率值与欺诈阈值进行比较，可以获取到每个特征群组内的超过欺诈阈值的待检测数据。例如，上述示例中，假设手机号群组中A、B、C的欺诈概率值超过欺诈阈值，则可以得到手机号群组内的待检测数据A、B、C。In the embodiment of the present invention, based on the fraud probability value of each data to be detected acquired in S110, the fraud probability value of the data to be detected in each feature group can be found. The fraud probability value of the data to be detected in each feature group is compared with the fraud threshold, and the data to be detected in each feature group that exceeds the fraud threshold can be obtained. For example, in the above example, assuming that the fraud probability values of A, B, and C in the mobile phone number group exceed the fraud threshold, the data to be detected A, B, and C in the mobile phone number group can be obtained.

需要注意的是，该欺诈阈值可以与S130中的基于欺诈阈值对获取到的待检测数据的欺诈概率值进行过滤的欺诈阈值相同，也可以针对各自的场景分别设置。It should be noted that the fraud threshold may be the same as the fraud threshold for filtering the obtained fraud probability value of the data to be detected based on the fraud threshold in S130, or it may be set separately for respective scenarios.

在S330中，筛选出所述欺诈概率值超过欺诈阈值的待检测数据所占对应的特征群组内的待检测数据的比例超过比例阈值的特征群组，所述特征群组为第一群组。In S330, the feature group whose proportion of the to-be-detected data whose fraud probability value exceeds the fraud threshold value in the corresponding feature group exceeds the ratio threshold is filtered out, and the feature group is the first group .

本发明实施例中，在获取到每个特征群组中超过欺诈概率值超过欺诈阈值的待检测数据后，确定这些待检测数据占据对应的特征群组的待检测数据的比例，从而筛选出超过比例阈值的特征群组，筛选出的特征群组为第一群组。In the embodiment of the present invention, after obtaining the data to be detected in each feature group that exceeds the fraud probability value and exceeds the fraud threshold, it is determined that the data to be detected occupy the proportion of the data to be detected in the corresponding feature group, and the excess For the feature group of the ratio threshold, the filtered feature group is the first group.

例如，上述示例中，手机号群组内的欺诈概率值超过欺诈阈值的待检测数据为A、B、C，其占该手机号群组的待检测数据的比例为：3/5＝0.6，假设比例阈值为0.5，则该手机号群组即为第一群组。For example, in the above example, the data to be detected whose fraud probability value exceeds the fraud threshold in the mobile phone number group are A, B, and C, and the proportion of the data to be detected in the mobile phone number group is: 3/5=0.6, Assuming that the ratio threshold is 0.5, the mobile phone number group is the first group.

需要指出的是，筛选出的第一群组可以视情况再次运用图模型迭代。It should be pointed out that the first group selected can be iterated again with the graph model as appropriate.

本发明实施例中，基于提升树模型获取待检测数据的欺诈概率值与图模型共同确定第一群组，这样一方面融合了提升树模型的标签信息，另一方面提高了图模型获取的第一群组的精确率和召回率。In the embodiment of the present invention, the fraud probability value of the data to be detected is obtained based on the boosted tree model and the graph model is used to determine the first group. This combines the label information of the boosted tree model on the one hand and improves the first group obtained by the graph model on the other hand. The precision and recall rate of a group.

根据本发明实施例，在获取到第一群组后，可以基于图数据中所述第一群组内的每个待检测数据与超过所述欺诈阈值的待检测数据的距离，获取所述第一群组的直间度距离。According to the embodiment of the present invention, after the first group is obtained, the first group may be obtained based on the distance between each data to be detected in the first group in the graph data and the data to be detected that exceeds the fraud threshold. The straightness distance of a group.

本发明实施例中，两个数据的距离可以用这两个数据之间的边表的数目表示，例如，图2所示的图数据库中,A与B的距离是1，A与C的距离是2，A与D的距离是3。In the embodiment of the present invention, the distance between two data can be represented by the number of side tables between the two data. For example, in the graph database shown in FIG. 2, the distance between A and B is 1, and the distance between A and C It is 2, and the distance between A and D is 3.

本发明实施例中，直间度距离是指某群组内的每个数据距离其图数据库内欺诈数据的距离的倒数的均值。在获取到第一群组内的每个待检测数据与超过所述欺诈阈值的待检测数据的距离后，可以获取到该第一群组内的每个待检测数据与其图数据库中的超过欺诈阈值的待检测数据的距离的倒数的均值，该均值为该第一群组的直间度距离。本发明实施例中，直间度距离取值在0到1之间(进行归一化后)，该值越大表明该群组内的数据与欺诈数据(黑样本)“距离”越近，也就是欺诈程度越高。需要说明的是，一个数据的图数据库是指存在该数据与其他数据的边表的数据库，若两个数据之间不存在任何边表，则认为这两个数据在两个图数据库中。In the embodiment of the present invention, the directness distance refers to the average value of the reciprocal of the distance between each data in a certain group and the fraud data in the graph database. After obtaining the distance between each data to be detected in the first group and the data to be detected that exceeds the fraud threshold, it is possible to obtain each data to be detected in the first group and the excess fraud in the graph database. The average value of the reciprocal of the distance of the data to be detected of the threshold, and the average value is the straightness distance of the first group. In the embodiment of the present invention, the straightness distance is between 0 and 1 (after normalization). The larger the value, the closer the "distance" between the data in the group and the fraud data (black samples). That is, the higher the degree of fraud. It should be noted that a graph database of one data refers to a database where there are side tables of the data and other data. If there is no side table between the two data, the two data are considered to be in the two graph databases.

例如，上述示例中，假设C、D的欺诈概率值超过欺诈阈值，则包括A、B、C、D的群组的直间度距离为：A、B、C、D分别距离其它数据的距离的倒数的均值。For example, in the above example, assuming that the fraud probability values of C and D exceed the fraud threshold, the straightness distance of the group including A, B, C, and D is: the distance between A, B, C, and other data respectively The mean of the reciprocal of.

下面结合具体的实施例，对本发明实施例中获取关联规则模型的方法进行详细的说明。The method for obtaining the association rule model in the embodiment of the present invention will be described in detail below in conjunction with specific embodiments.

图4是根据一示例性实施例示出的获取关联规则模型的方法的流程图。如图4所示，该方法可以包括但不限于以下步骤：Fig. 4 is a flow chart showing a method for obtaining an association rule model according to an exemplary embodiment. As shown in Figure 4, the method may include but is not limited to the following steps:

在S410中，获取样本数据。In S410, sample data is acquired.

本发明实施例中，样本数据可以是涉及到欺诈性质的历史数据，包括其对应的真实结果，即白样本、黑样本，其中黑样本为欺诈样本。In the embodiment of the present invention, the sample data may be historical data related to the nature of fraud, including its corresponding real results, namely white samples and black samples, where the black samples are fraud samples.

在S420中，基于关联规则初始模型获取所述样本数据的多个规则群组。In S420, multiple rule groups of the sample data are acquired based on the initial model of association rules.

本发明实施例中，可以基于Apriori，Eclat，FP-Growth，Ripper以及C50等算法设置关联规则初始模型。根据样本数据的多维特征构造出更多维的特征后，基于该关联规则初始模型获取到样本数据的多个规则群组。例如，规则为：无职业，年龄在20-30岁，性别为男，获取到的该规则的规则群组中包括样本数据A、B、C、D。In the embodiment of the present invention, the initial model of association rules can be set based on algorithms such as Apriori, Eclat, FP-Growth, Ripper, and C50. After constructing more dimensional features based on the multi-dimensional features of the sample data, multiple rule groups of the sample data are obtained based on the initial model of the association rules. For example, the rule is: no occupation, age 20-30, gender is male, the obtained rule group of the rule includes sample data A, B, C, and D.

在S430中，基于所述多个规则群组内样本数据的真实结果确定每个规则群组对应的规则的提升度。In S430, the promotion degree of the rule corresponding to each rule group is determined based on the real result of the sample data in the multiple rule groups.

本发明实施例中，Lift(提升度)：表示“包含X的事务中同时包含Y事务的比例”与“包含Y事务的比例”的比值。公式表达：lift(X->Y)＝conf(X->Y)/supp(Y)＝P(X and Y)/(P(X)*P(Y))＝conf(Y->X)/supp(X)，其中conf为置信度，supp为支持度。提升度反映了关联规则中的两个规则的相关性，提升度>1且越高表明正相关性越高，提升度<1且越低表明负相关性越高，提升度＝1表明没有相关性。提升度也可以表达成Lift＝(P(A&B)/P(A))/P(B)＝P(A&B)/P(A)/P(B)。In the embodiment of the present invention, Lift (lift): indicates the ratio of "the proportion of transactions that include X at the same time that includes Y transactions" and "the proportion of transactions that include Y". Formula expression: lift(X->Y)=conf(X->Y)/supp(Y)=P(X and Y)/(P(X)*P(Y))=conf(Y->X) /supp(X), where conf is the confidence level and supp is the support level. The promotion degree reflects the correlation of the two rules in the association rules. The promotion degree> 1 and higher indicates the higher the positive correlation, the promotion degree <1 and the lower indicates the higher the negative correlation, and the promotion degree = 1 indicates no correlation Sex. Lift can also be expressed as Lift=(P(A&B)/P(A))/P(B)=P(A&B)/P(A)/P(B).

本发明实施例中，获取到提升度后，对提升度进行归一化，提升度可以用于衡量群组共性欺诈程度，若某一规则的提升度越大，表明该规则对于识别黑样本的能力越强，也就是符合该规则的样本欺诈程度越高。例如，假设上述示例中，其中A、B、C样本真实结果为欺诈样本，即黑样本，而D为白样本，其中样本总共包括10个，黑样本共5个，则提升度Lift＝黑样本在该规则群组比例/所有黑样本占所有样本的比例＝0.75/0.5＝1.5。In the embodiment of the present invention, after obtaining the promotion degree, the promotion degree is normalized. The promotion degree can be used to measure the degree of group common fraud. If the promotion degree of a certain rule is larger, it indicates that the rule is effective for identifying black samples. The stronger the ability, that is, the higher the degree of fraud in the sample that complies with the rule. For example, suppose that in the above example, the real results of samples A, B, and C are fraud samples, that is, black samples, and D is a white sample, where a total of 10 samples and a total of 5 black samples, then lift Lift = black sample In the rule group ratio/the ratio of all black samples to all samples=0.75/0.5=1.5.

在S440中，筛选出所述提升度超过提升度阈值的规则群组。In S440, the rule group whose promotion degree exceeds the promotion degree threshold is filtered out.

根据本发明实施例，可以设置一可调整的提升度阈值。According to the embodiment of the present invention, an adjustable lift threshold can be set.

在S450中，基于所述规则群组获取所述关联规则模型；其中，所述关联规则模型能够获取所述规则群组对应的规则以及所述规则的提升度。In S450, the association rule model is obtained based on the rule group; wherein, the association rule model can obtain the rule corresponding to the rule group and the promotion degree of the rule.

本发明实施例中，基于筛选出的提升度超过提升度阈值的规则群组，可以获取到该规则群组对应的关联规则模型，该关联规则模型能够获取到该规则以及该规则的提升度。In the embodiment of the present invention, based on the selected rule group whose lift exceeds the lift threshold, the association rule model corresponding to the rule group can be obtained, and the association rule model can obtain the rule and the lift of the rule.

例如，上述示例中，假设提升度阈值为1，则规则为：无职业，年龄在20-30岁，性别为男对应的规则群组中A、B、C样本真实结果为欺诈样本，即黑样本，而D为白样本，其中样本总共包括10个，黑样本共5个，该规则的提升度为1.5，大于提升高阈值，则能够获取到该规则群组的关联规则初始模型即为关联规则模型，该关联规则模型能够获取到的规则为：无职业，年龄在20-30岁，性别为男，该规则的提升度为1.5。For example, in the above example, assuming that the promotion threshold is 1, the rule is: no occupation, 20-30 years old, and gender in the rule group corresponding to male samples A, B, and C. The true result is a fraud sample, that is, black Sample, and D is a white sample, which includes a total of 10 samples and a total of 5 black samples. The lift of this rule is 1.5, which is greater than the lift high threshold, and the initial model of the association rule of the rule group is obtained. Rule model, the rules that the association rule model can obtain are: no occupation, age 20-30 years old, gender is male, the degree of promotion of this rule is 1.5.

本发明实施例中，利用提升度(Lift)进行规则强弱筛选，并融合所有强规则，从而融合关联规则模型的优点，提升识别欺诈群组的准确度，同时规则的存在也增强了整个模型的解释力。In the embodiment of the present invention, the lifting degree (Lift) is used to screen the strong and weak rules, and all strong rules are merged, thereby fusing the advantages of the association rule model and improving the accuracy of identifying fraud groups. At the same time, the existence of rules also enhances the entire model Explanatory power.

根据本发明实施例，在对待检测数据进行识别时，可以基于获取到的待检测数据的欺诈概率值，筛选出待检测数据中欺诈概率值超过欺诈阈值的待检测数据，从而将所述待检测数据输入至所述关联规则模型，以获取所述规则对应的第二群组。According to the embodiment of the present invention, when identifying the data to be detected, based on the obtained fraud probability value of the data to be detected, the data to be detected whose fraud probability value exceeds the fraud threshold in the data to be detected can be screened out, so that the The data is input to the association rule model to obtain the second group corresponding to the rule.

例如，待检测数据为A、B、C，基于提升树模型获取到A、B、C的欺诈概率值后，其中，A的欺诈概率值小于欺诈阈值，则可以筛选出B、C，将B、C输入至关联规则模型，以获取第二群组。For example, the data to be detected are A, B, and C. After obtaining the fraud probability values of A, B, and C based on the boosting tree model, where the fraud probability value of A is less than the fraud threshold, then B and C can be screened out, and B , C is input to the association rule model to obtain the second group.

上述实施例中，实现了提升树模型与关联规则模型的融合，提升了第二群组中欺诈数据的概率，强化了规则的提升度。In the foregoing embodiment, the fusion of the promotion tree model and the association rule model is realized, the probability of fraud data in the second group is improved, and the promotion of the rule is strengthened.

下面结合具体的实施例，对本发明实施例中利用样本数据获取到打分模型的方法进行详细的说明。需要指出的是，本实施例中以样本数据为例进行说明，但本发明并不限于此，例如，本实施例中的样本数据也可以替换成测试数据、样本数据或待检测数据等。The method for obtaining a scoring model by using sample data in an embodiment of the present invention will be described in detail below with reference to specific embodiments. It should be pointed out that, in this embodiment, sample data is used as an example for description, but the present invention is not limited to this. For example, the sample data in this embodiment can also be replaced with test data, sample data, or data to be tested.

图5是根据一示例性实施例示出的利用样本数据获取到打分模型的方法的流程图。如图5所示，该方法可以包括但不限于以下步骤：Fig. 5 is a flow chart showing a method for obtaining a scoring model by using sample data according to an exemplary embodiment. As shown in Figure 5, the method may include but is not limited to the following steps:

在S510中，基于提升树模型获取样本数据的欺诈概率值。In S510, the fraud probability value of the sample data is obtained based on the boosting tree model.

在S520中，根据图模型以及所述样本数据的欺诈概率值获取第一群组。In S520, the first group is obtained according to the graph model and the fraud probability value of the sample data.

在S530中，基于关联规则模型以及所述样本数据的欺诈概率值,从所述样本数据中获取规则对应的第二群组。In S530, a second group corresponding to the rule is obtained from the sample data based on the association rule model and the fraud probability value of the sample data.

在S540中，基于所述样本数据的欺诈概率值确定打分模型。In S540, a scoring model is determined based on the fraud probability value of the sample data.

本发明实施例中，可以将初始打分模型中获取的欺诈群组的分数映射到所述欺诈群组内的每个待检测数据，得到所述欺诈群组内的每个待检测数据的分数，然后基于所述欺诈群组内的每个待检测数据的分数以及欺诈概率值，确定所述初始打分模型中的权重，基于所述权重得到所述打分模型。本发明实施例中，打分模型可以表示如下：In the embodiment of the present invention, the score of the fraud group obtained in the initial scoring model can be mapped to each data to be detected in the fraud group to obtain the score of each data to be detected in the fraud group. Then, based on the score of each data to be detected in the fraud group and the fraud probability value, the weight in the initial scoring model is determined, and the scoring model is obtained based on the weight. In the embodiment of the present invention, the scoring model can be expressed as follows:

其中，Score为欺诈群组的分数，表示欺诈群组为欺诈群组的概率。Dist为直间度距离，Lift为提升度，W为权重，Probs为欺诈概率值，Topn为一种特定计算方式，只选择群组最高提升度的n条规则的提升度进行平均，而不是对所有平均。Among them, Score is the score of the fraud group, which represents the probability that the fraud group is a fraud group. Dist is the straightness distance, Lift is the lift, W is the weight, Probs is the fraud probability value, and Topn is a specific calculation method. Only the lifts of the n rules with the highest lift of the group are selected for averaging, instead of All average.

本发明实施例中，为确定上述公式(1)中的W，可以设置一初始W，该初始W对应的模型为初始打分模型，基于该初始打分模型可以获取到一欺诈群组的分数，将该初始打分模型中的群组Score映射到该欺诈群组内各个样本，得到该欺诈群组内的每个样本的Score，然后通过最大化每个样本的Score与欺诈概率值Probs的皮尔逊相似度系数，来自动计算或训练该初始打分模型，确定W。需注意，在这种情况下，即使没有样本数据，依然可以基于待检测数据的欺诈概率值对初始打分模型进行自动训练，以确定W，从而确定打分模型。In the embodiment of the present invention, in order to determine W in the above formula (1), an initial W can be set. The model corresponding to the initial W is the initial scoring model. Based on the initial scoring model, the score of a fraud group can be obtained. The group Score in the initial scoring model is mapped to each sample in the fraud group, the Score of each sample in the fraud group is obtained, and then the Score of each sample is similar to the Pearson of the fraud probability value Probs by maximizing the score Degree coefficient to automatically calculate or train the initial scoring model to determine W. It should be noted that in this case, even if there is no sample data, the initial scoring model can still be automatically trained based on the fraud probability value of the data to be detected to determine W, thereby determining the scoring model.

例如，可以通过如下公式确定W：For example, W can be determined by the following formula:

which w＝argmax _w Similarity(Score，Probs) (2) which w=argmax _w Similarity(Score, Probs) (2)

需要说明的是，上述公式中，Score表示的是欺诈群组内的每个样本的分数。It should be noted that in the above formula, Score represents the score of each sample in the fraud group.

上述实施例中，在获取到样本数据的欺诈概率值、第一群组、第二群组后，最大化欺诈群组内的每个样本的分数与该样本的欺诈概率值的皮尔逊相似度系数，对W进行监督式的学习，确定打分模型以及欺诈群组，提升了识别目标欺诈群组的准确率。In the above embodiment, after obtaining the fraud probability value, the first group, and the second group of the sample data, maximize the Pearson similarity between the score of each sample in the fraud group and the fraud probability value of the sample Coefficient, supervised learning of W, determine scoring model and fraud group, improve the accuracy of identifying target fraud group.

需要注意的是，不仅可以基于样本数据的欺诈概率值对初始打分模型进行训练，还可以基于样本数据的真实结果对初始打分模型进行训练，例如，基于欺诈群组中的每个样本数据的真实结果，确定欺诈群组的真实欺诈概率，然后最大化该样本的真实欺诈概率与该样本的分数的皮尔逊相似度系数，从而确定W。It should be noted that not only can the initial scoring model be trained based on the fraud probability value of the sample data, but also the initial scoring model can be trained based on the true results of the sample data, for example, based on the true value of each sample data in the fraud group As a result, the true fraud probability of the fraud group is determined, and then the Pearson similarity coefficient between the true fraud probability of the sample and the score of the sample is maximized, thereby determining W.

在S550中，将所述欺诈概率值、所述第一群组、所述第一群组的直间度距离、所述第二群组以及所述规则的提升度输入至所述打分模型，确定目标欺诈群组。In S550, input the fraud probability value, the first group, the straightness distance of the first group, the second group and the lift of the rule into the scoring model, Determine the target fraud group.

本发明实施例中，在确定样本数据的欺诈概率值后可以确定打分模型，通过将欺诈概率值、所述第一群组、所述第一群组的直间度距离、所述第二群组、所述规则的提升度输入至打分模型，可以输出欺诈群组以及各欺诈群组的得分，然后基于得分对欺诈群组进行排序以及筛选，从欺诈群组中确定目标欺诈群组。In the embodiment of the present invention, the scoring model can be determined after the fraud probability value of the sample data is determined, by combining the fraud probability value, the first group, the straightness distance of the first group, and the second group The group and the improvement of the rule are input to the scoring model, and the fraud group and the score of each fraud group can be output, and then the fraud group is sorted and filtered based on the score, and the target fraud group is determined from the fraud group.

本发明上述实施例中，实现了对打分模型的自动训练，使整个流程更加自动化，对图模型输出的“直间度距离”和关联规则模型输出的“提升度”进行监督式的加权求和，所谓监督式指权重自动通过最大化每个样本的Score与LightGBM输出的该样本的Probs的皮尔逊相似系数来计算，无需人工干预。In the above-mentioned embodiment of the present invention, automatic training of the scoring model is realized, and the whole process is more automated. Supervised weighted summation is performed on the "straightness distance" output by the graph model and the "lift degree" output by the association rule model. The so-called supervised type means that the weight is automatically calculated by maximizing the Pearson similarity coefficient between the Score of each sample and the Probs of the sample output by LightGBM, without manual intervention.

根据本发明实施例，在获取打分模型后，可以基于待检测数据得到的欺诈概率值、所述第一群组、所述第一群组的直间度距离、所述第二群组、所述规则的提升度输入该打分模型，得到各欺诈群组的分数，从中选择分数最高的或者超过分数阈值的欺诈群组，该(些)欺诈群组为目标欺诈群组，从而实现从待检测数据中确定目标欺诈群组。According to the embodiment of the present invention, after obtaining the scoring model, the fraud probability value obtained based on the data to be detected, the straightness distance of the first group and the first group, the second group, the The improvement of the rule is input into the scoring model, and the scores of each fraud group are obtained, and the fraud group with the highest score or exceeding the score threshold is selected. These fraud groups are the target fraud groups, so as to realize the detection Identify the target fraud group in the data.

下面结合具体的实施例，对本发明实施例中的数据处理的方法进行详细的说明。The data processing method in the embodiment of the present invention will be described in detail below in conjunction with specific embodiments.

图6是根据一示例性实施例示出的一种模型间数据流转的示意图。本发明实施例中，模型可以包括：提升树模型LightGBM、图模型LouVain、关联规则模型Association Rules以及打分模型Score。Fig. 6 is a schematic diagram showing data flow between models according to an exemplary embodiment. In the embodiment of the present invention, the model may include: a boosted tree model LightGBM, a graph model LouVain, an association rule model Association Rules, and a scoring model Score.

如图6所示，该方法可以包括但不限于以下流程：As shown in Figure 6, the method may include but is not limited to the following processes:

在S601中，获取样本数据的特征工程数据,并将该特征工程数据发送至LightGBM模型以及Association Rules模型。In S601, the characteristic engineering data of the sample data is obtained, and the characteristic engineering data is sent to the LightGBM model and the Association Rules model.

本发明实施例中，对样本数据进行特征工程处理，可以包括基于样本数据的多维特征，可以构造出更多维的特征。样本数据的特征工程数据是指构造出的样本数据的更多维的特征数据。In the embodiment of the present invention, performing feature engineering processing on sample data may include multi-dimensional features based on the sample data, and more dimensional features can be constructed. The feature engineering data of the sample data refers to the more-dimensional feature data of the constructed sample data.

在S602中，LightGBM模型根据输入的特征工程数据获取样本数据的欺诈概率值。In S602, the LightGBM model obtains the fraud probability value of the sample data according to the input feature engineering data.

在S603中，LightGBM模型将欺诈概率值分别发送至LouVain模型、Association Rules模型以及Score模型。In S603, the LightGBM model sends the fraud probability value to the LouVain model, Association Rules model and Score model respectively.

在S604中，获取样本数据的图数据,并将图数据发送至LouVain模型。In S604, the graph data of the sample data is obtained, and the graph data is sent to the LouVain model.

在S605中，LouVain模型基于图数据以及欺诈概率值获取第一群组以及第一群组的直间度距离。In S605, the LouVain model obtains the first group and the straightness distance of the first group based on the graph data and the fraud probability value.

本发明实施例中，可以对LouVain模型进行多次校验，例如，通过验证集数据进行一次以上校验，通过测试集数据进行两次以上校验。In the embodiment of the present invention, the LouVain model can be verified multiple times, for example, the verification set data is used for more than one verification, and the test set data is used for more than two verifications.

在S606中，LouVain模型将第一群组以及第一群组的直间度距离发送至打分模型。In S606, the LouVain model sends the first group and the straightness distance of the first group to the scoring model.

在S607中,Association Rules模型基于特征工程数据以及欺诈概率值,获取第二群组以及规则对应的提升度。In S607, the Association Rules model obtains the promotion degree corresponding to the second group and the rule based on the characteristic engineering data and the fraud probability value.

在S608中，Association Rules模型将第二群组以及规则对应的提升度发送至打分模型。In S608, the Association Rules model sends the second group and the promotion corresponding to the rule to the scoring model.

在S609中，打分模型根据欺诈概率值、第一群组以及第一群组的直间度距离、第二群组以及规则对应的提升度，获取欺诈群组以及每个群组的得分。In S609, the scoring model obtains the fraud group and the score of each group according to the fraud probability value, the straightness distance between the first group and the first group, the second group and the corresponding lift of the rule.

需要指出的是，可以通过基于个体数据的欺诈概率值确定群组的欺诈概率值，通过最大化目标欺诈群组的欺诈概率值与Score的皮尔逊相似度系数，确定打分模型，进而确定目标欺诈群组以及每个群组的得分。It should be pointed out that the fraud probability value of the group can be determined based on the fraud probability value of the individual data, and the scoring model can be determined by maximizing the fraud probability value of the target fraud group and the Pearson similarity coefficient of Score to determine the target fraud The group and the score of each group.

本发明实施例中，在获取到获取欺诈群组以及每个群组的得分后，可以基于得分对欺诈群组进行排序，根据排序选择Top N为目标欺诈群组。In the embodiment of the present invention, after obtaining the fraud group and the score of each group, the fraud group can be sorted based on the score, and Top N is selected as the target fraud group according to the ranking.

需要说明的是，上述群组的样本数之和N可以取决于样本总量(如200万)和欺诈样本比例(如千分之二)，如N＝4000。目标欺诈数据可用于任何反欺诈场景中，如，可以交给业务人员识别、预判和分析团伙作案。It should be noted that the sum of the number of samples in the above group N may depend on the total number of samples (such as 2 million) and the proportion of fraudulent samples (such as two thousandths), such as N=4000. Target fraud data can be used in any anti-fraud scenario, for example, it can be handed over to business personnel to identify, predict and analyze crimes committed by gangs.

应清楚地理解，本发明描述了如何形成和使用特定示例，但本发明的原理不限于这些示例的任何细节。相反，基于本发明公开的内容的教导，这些原理能够应用于许多其它实施例。It should be clearly understood that the present invention describes how to form and use specific examples, but the principles of the present invention are not limited to any details of these examples. On the contrary, based on the teaching of the disclosure of the present invention, these principles can be applied to many other embodiments.

下述为本发明装置实施例，可以用于执行本发明方法实施例。在下文对装置的描述中，与前述方法相同的部分，将不再赘述。The following are device embodiments of the present invention, which can be used to implement the method embodiments of the present invention. In the following description of the device, the same parts as the foregoing method will not be repeated.

图7是根据一示例性实施例示出的一种数据处理装置的结构示意图，其中，所述装置700包括：Fig. 7 is a schematic structural diagram showing a data processing device according to an exemplary embodiment, wherein the device 700 includes:

第一获取模块710，配置为基于提升树模型获取待检测数据的欺诈概率值；The first obtaining module 710 is configured to obtain the fraud probability value of the data to be detected based on the boosting tree model;

第二获取模块720，配置为根据图模型以及所述待检测数据的欺诈概率值获取第一群组；The second obtaining module 720 is configured to obtain the first group according to the graph model and the fraud probability value of the data to be detected;

第三获取模块730，配置为基于关联规则模型以及所述待检测数据的欺诈概率值,从所述待检测数据中获取规则对应的第二群组；The third obtaining module 730 is configured to obtain the second group corresponding to the rule from the data to be detected based on the association rule model and the fraud probability value of the data to be detected;

确定模块740，配置为基于所述待检测数据的欺诈概率值、所述第一群组以及所述第二群组确定所述待检测数据中的目标欺诈群组。The determining module 740 is configured to determine a target fraud group in the data to be detected based on the fraud probability value of the data to be detected, the first group and the second group.

图8是根据一示例性实施例示出的一种电子设备的结构示意图。需要说明的是，图8示出的电子设备仅仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。Fig. 8 is a schematic structural diagram of an electronic device according to an exemplary embodiment. It should be noted that the electronic device shown in FIG. 8 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.

如图8所示，计算机***800包括中央处理单元(CPU)801，其可以根据存储在只读存储器(ROM)802中的程序或者从存储部分808加载到随机访问存储器(RAM)803中的程序而执行各种适当的动作和处理。在RAM 803中，还存储有系统800操作所需的各种程序和数据。CPU 801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。As shown in FIG. 8, the computer system 800 includes a central processing unit (CPU) 801, which can be based on a program stored in a read-only memory (ROM) 802 or a program loaded from a storage part 808 into a random access memory (RAM) 803 And perform various appropriate actions and processing. In the RAM 803, various programs and data required for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other through a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

以下部件连接至I/O接口805：包括键盘、鼠标等的输入部分806；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分807；包括硬盘等的存储部分808；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分809。通信部分809经由诸如因特网的网络执行通信处理。驱动器810也根据需要连接至I/O接口805。可拆卸介质811，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器810上，以便于从其上读出的计算机程序根据需要被安装入存储部分808。The following components are connected to the I/O interface 805: an input part 806 including a keyboard, a mouse, etc.; an output part 807 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and speakers, etc.; a storage part 808 including a hard disk, etc. ; And a communication section 809 including a network interface card such as a LAN card, a modem, etc. The communication section 809 performs communication processing via a network such as the Internet. The driver 810 is also connected to the I/O interface 805 as needed. A removable medium 811, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 810 as needed, so that the computer program read from it is installed into the storage section 808 as needed.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分809从网络上被下载和安装，和/或从可拆卸介质811被安装。在该计算机程序被中央处理单元(CPU)801执行时，执行本申请的终端中限定的上述功能。In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart can be implemented as a computer software program. For example, the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network through the communication part 809, and/or installed from the removable medium 811. When the computer program is executed by the central processing unit (CPU) 801, the above-mentioned functions defined in the terminal of the present application are executed.

需要说明的是，本申请所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的***、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行***、装置或者器件使用或者与其结合使用。而在本申请中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、电线、光缆、RF等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two. The computer-readable storage medium may be, for example, but not limited to, an electric, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this application, the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium. The computer-readable medium may send, propagate or transmit the program for use by or in combination with the instruction execution system, apparatus, or device . The program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the above.

附图中的流程图和框图，图示了按照本申请各种实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的***来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate the possible implementation of the system architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present application. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the above-mentioned module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may also occur in a different order than noted in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram or flowchart, and the combination of blocks in the block diagram or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations, or can be It is realized by a combination of dedicated hardware and computer instructions.

描述于本申请实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中，例如，可以描述为：一种处理器包括第一获取模块、第二获取模块、第三获取模块以及确定模块。其中，这些模块的名称在某种情况下并不构成对该模块本身的限定。The units involved in the embodiments described in the present application can be implemented in software or hardware. The described unit may also be provided in the processor, for example, it may be described as: a processor includes a first acquiring module, a second acquiring module, a third acquiring module, and a determining module. Among them, the names of these modules do not constitute a limitation on the module itself under certain circumstances.

以上具体示出和描述了本发明的示例性实施例。应可理解的是，本发明不限于这里描述的详细结构、设置方式或实现方法；相反，本发明意图涵盖包含在所附权利要求的精神和范围内的各种修改和等效设置。The exemplary embodiments of the present invention have been specifically shown and described above. It should be understood that the present invention is not limited to the detailed structure, arrangement or implementation method described herein; on the contrary, the present invention intends to cover various modifications and equivalent arrangements included in the spirit and scope of the appended claims.

Claims

一种数据处理方法，其特征在于，所述方法包括：A data processing method, characterized in that the method includes:

基于提升树模型获取待检测数据的欺诈概率值；Obtain the fraud probability value of the data to be detected based on the boosting tree model;

根据图模型以及所述待检测数据的欺诈概率值获取第一群组；Obtaining the first group according to the graph model and the fraud probability value of the data to be detected;

基于关联规则模型以及所述待检测数据的欺诈概率值,从所述待检测数据中获取规则对应的第二群组；Obtaining the second group corresponding to the rule from the data to be detected based on the association rule model and the fraud probability value of the data to be detected;

基于所述待检测数据的欺诈概率值、所述第一群组以及所述第二群组确定所述待检测数据中的目标欺诈群组。A target fraud group in the data to be detected is determined based on the fraud probability value of the data to be detected, the first group and the second group.
如权利要求1所述的方法，其特征在于，根据图模型以及所述待检测数据的欺诈概率值获取第一群组之前，所述方法包括：The method of claim 1, wherein before obtaining the first group according to the graph model and the fraud probability value of the data to be detected, the method comprises:

以每个待检测数据作为顶点表，提取所述待检测数据中相同的维度特征作为边表，并根据所述各维度特征的权重计算出所述边表的关联值；Taking each data to be detected as a vertex table, extracting the same dimensional feature in the data to be detected as an edge table, and calculating the associated value of the edge table according to the weight of each dimensional feature;

根据所述顶点表、所述边表以及所述边表的关联值生成所述待检测数据的图数据。The graph data of the data to be detected is generated according to the associated values of the vertex table, the edge table, and the edge table.
如权利要求2所述的方法，其特征在于，根据图模型以及所述待检测数据的欺诈概率值获取第一群组,包括：The method of claim 2, wherein obtaining the first group according to the graph model and the fraud probability value of the data to be detected comprises:

基于图模型获取所述待检测数据的多个特征群组；Acquiring multiple feature groups of the data to be detected based on a graph model;

获取所述多个特征群组中每个特征群组内欺诈概率值超过欺诈阈值的待检测数据；Acquiring data to be detected whose fraud probability value in each feature group in the multiple feature groups exceeds a fraud threshold;

筛选出所述欺诈概率值超过欺诈阈值的待检测数据所占对应的特征群组内的待检测数据的比例超过比例阈值的特征群组，所述特征群组为第一群组。The feature group whose proportion of the to-be-detected data whose fraud probability value exceeds the fraud threshold value in the corresponding feature group exceeds the ratio threshold is filtered out, and the feature group is the first group.
如权利要求3所述的方法，其特征在于，所述方法还包括：获取所述关联规则模型；The method according to claim 3, wherein the method further comprises: obtaining the association rule model;

获取样本数据；Obtain sample data;

基于关联规则初始模型获取所述样本数据的多个规则群组；Acquiring multiple rule groups of the sample data based on the initial model of association rules;

基于所述多个规则群组内样本数据的真实结果确定每个规则群组对应的规则的提升度；Determining the promotion degree of the rule corresponding to each rule group based on the real results of the sample data in the multiple rule groups;

筛选出所述提升度超过提升度阈值的规则群组；Filter out the rule groups whose promotion degree exceeds the promotion degree threshold;

基于所述规则群组获取所述关联规则模型；其中，所述关联规则模型能够获取所述规则群组对应的规则以及所述规则的提升度。The association rule model is obtained based on the rule group; wherein, the association rule model can obtain the rule corresponding to the rule group and the promotion degree of the rule.
如权利要求4所述的方法，其特征在于，基于关联规则模型以及所述待检测数据的欺诈概率值,从所述待检测数据中获取规则对应的第二群组，包括：The method according to claim 4, wherein, based on the association rule model and the fraud probability value of the data to be detected, obtaining the second group corresponding to the rule from the data to be detected comprises:

筛选出所述待检测数据的欺诈概率值超过所述欺诈阈值的待检测数据；Screening out the data to be detected whose fraud probability value of the data to be detected exceeds the fraud threshold;

将所述待检测数据输入至所述关联规则模型，以获取所述规则对应的第二群组。The data to be detected is input into the association rule model to obtain the second group corresponding to the rule.
如权利要求5所述的方法，其特征在于，基于所述待检测数据的欺诈概率值、所述第一群组以及所述第二群组确定所述待检测数据中的目标欺诈群组，包括：The method of claim 5, wherein the target fraud group in the data to be detected is determined based on the fraud probability value of the data to be detected, the first group and the second group, include:

基于所述第一群组获取所述第一群组的直间度距离；Obtaining the straightness distance of the first group based on the first group;

基于所述待检测数据的欺诈概率值确定打分模型；Determining a scoring model based on the fraud probability value of the data to be detected;

将所述欺诈概率值、所述第一群组、所述第一群组的直间度距离、所述第二群组以及所述规则的提升度输入至所述打分模型，确定所述待检测数据中的目标欺诈群组。Input the fraud probability value, the straightness distance of the first group, the first group, the second group, and the lift of the rule into the scoring model to determine the waiting Detect target fraud groups in the data.
如权利要求6所述的方法，其特征在于，基于所述第一群组获取所述第一群组的直间度距离，包括：The method according to claim 6, wherein obtaining the straightness distance of the first group based on the first group comprises:

基于所述图数据中所述第一群组内的每个待检测数据与超过所述欺诈阈值的待检测数据的距离，获取所述第一群组的直间度距离。Obtain the straightness distance of the first group based on the distance between each data to be detected in the first group in the graph data and the data to be detected that exceeds the fraud threshold.
如权利要求6所述的方法，其特征在于，基于所述待检测数据的欺诈概率值确定打分模型，包括：The method according to claim 6, wherein determining a scoring model based on the fraud probability value of the data to be detected comprises:

将初始打分模型中获取的欺诈群组的分数映射到所述欺诈群组内的每个待检测数据，得到所述欺诈群组内的每个待检测数据的分数；Mapping the score of the fraud group obtained in the initial scoring model to each data to be detected in the fraud group to obtain the score of each data to be detected in the fraud group;

基于所述欺诈群组内的每个待检测数据的分数以及欺诈概率值，确定所述初始打分模型中的权重；Determining the weight in the initial scoring model based on the score of each to-be-detected data in the fraud group and the fraud probability value;

基于所述权重得到所述打分模型。The scoring model is obtained based on the weight.
一种数据处理装置，其特征在于，所述装置包括：A data processing device, characterized in that the device includes:

第一获取模块，配置为基于提升树模型获取待检测数据的欺诈概率值；The first obtaining module is configured to obtain the fraud probability value of the data to be detected based on the boosting tree model;

第二获取模块，配置为根据图模型以及所述待检测数据的欺诈概率值获取第一群组；The second obtaining module is configured to obtain the first group according to the graph model and the fraud probability value of the data to be detected;

第三获取模块，配置为基于关联规则模型以及所述待检测数据的欺诈概率值,从所述待检测数据中获取规则对应的第二群组；The third obtaining module is configured to obtain the second group corresponding to the rule from the data to be detected based on the association rule model and the fraud probability value of the data to be detected;

确定模块，配置为基于所述待检测数据的欺诈概率值、所述第一群组以及所述第二群组确定所述待检测数据中的目标欺诈群组。The determining module is configured to determine a target fraud group in the data to be detected based on the fraud probability value of the data to be detected, the first group and the second group.
一种电子设备，其特征在于，包括：An electronic device, characterized in that it comprises:

一个或多个处理器；One or more processors;

存储装置，用于存储一个或多个程序；Storage device for storing one or more programs;

当所述一个或多个程序被所述一个或多个处理器执行，使得所述一个或多个处理器实现如权利要求1-8中任一所述的方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1-8.
一种计算机可读介质，其上存储有计算机程序，其特征在于，所述程序被处理器执行时实现如权利要求1-8中任一所述的方法。A computer-readable medium having a computer program stored thereon, wherein the program is executed by a processor to implement the method according to any one of claims 1-8.