WO2020042795A1 - 样本属性评估模型训练方法、装置及服务器 - Google Patents

样本属性评估模型训练方法、装置及服务器 Download PDF

Info

Publication number
WO2020042795A1
WO2020042795A1 PCT/CN2019/096287 CN2019096287W WO2020042795A1 WO 2020042795 A1 WO2020042795 A1 WO 2020042795A1 CN 2019096287 W CN2019096287 W CN 2019096287W WO 2020042795 A1 WO2020042795 A1 WO 2020042795A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
black
samples
community
attribute evaluation
Prior art date
Application number
PCT/CN2019/096287
Other languages
English (en)
French (fr)
Inventor
王修坤
赵婷婷
刘斌
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020042795A1 publication Critical patent/WO2020042795A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Definitions

  • the embodiments of the present specification relate to the field of Internet technologies, and in particular, to a method, a device, and a server for training a sample attribute evaluation model.
  • the embodiments of the present specification provide and a sample attribute evaluation method, device and server.
  • an embodiment of the present specification provides a sample attribute evaluation method, including: determining a black sample concentration of each community in a relationship graph corresponding to a training sample, wherein the training sample includes a black sample and an unknown sample; The black sample concentration of each community is described, the white sample sampling probability of each of the unknown samples is determined, and the white sample sampling probability of each of the unknown samples is sampled to obtain a white sample; based on the semi-supervised machine learning algorithm, A black sample is trained with the white sample to obtain a target sample attribute evaluation model.
  • an embodiment of the present specification provides a sample attribute evaluation model training device, including: a first determining unit, configured to determine a black sample concentration of each community in a relationship diagram corresponding to a training sample, wherein the training sample Including a black sample and an unknown sample; a second determining unit, configured to determine a white sample sampling probability of each of the unknown samples based on the black sample concentration of each community, and the white sample sampling probability of each of the unknown samples Sampling is performed to obtain a white sample; a training unit is used for a base-semi-supervised machine learning algorithm to train the black sample and the white sample to obtain a target sample attribute evaluation model.
  • an embodiment of the present specification provides a server, including a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the processor executes the program, any one of the foregoing is implemented. Steps of the sample attribute evaluation method.
  • an embodiment of the present specification provides a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the sample attribute evaluation method according to any one of the above.
  • the training sample includes only a small number of black samples with confirmed attributes, and an unknown sample with most unconfirmed attributes.
  • the method in this embodiment can also be used from unknown samples. Mining potential black samples, and then determining the white samples required for model training, meet the model training requirements, so that the trained model can accurately evaluate whether the samples belong to the attributes of the black samples.
  • FIG. 1 is a schematic diagram of a sample attribute evaluation application scenario according to an embodiment of the present specification
  • FIG. 2 is a flowchart of a sample attribute evaluation method according to the first aspect of the embodiment of the present specification
  • FIG. 3 is a schematic structural diagram of a training device for a sample attribute evaluation model according to a second aspect of the embodiment of the present specification
  • FIG. 4 is a schematic structural diagram of a sample attribute evaluation server according to a third aspect of the embodiment of the present specification.
  • FIG. 1 is a schematic diagram of a sample attribute evaluation application scenario according to an embodiment of the present specification.
  • the terminal 100 is located on the user side and communicates with the server 200 on the network side.
  • the user can generate real-time events and some business data through the APP or website in the terminal 100.
  • the server 200 collects real-time events generated by each terminal, and can then select training samples.
  • the embodiments of this specification can be applied to risk control scenarios such as risk sample identification or fraud insurance sample identification in insurance claims, and can also be applied to two-class classification scenarios.
  • an embodiment of the present specification provides a sample attribute evaluation method. Please refer to FIG. 2, which includes steps S201-S203.
  • S201 Determine the black sample concentration of each community in the relationship diagram corresponding to the training samples, where the training samples include black samples and unknown samples;
  • S202 Determine a white sample sampling probability of each of the unknown samples based on the black sample concentration of each community, and sample with the white sample sampling probability of each of the unknown samples to obtain a white sample;
  • S203 Training the black samples and the white samples based on a semi-supervised machine learning algorithm to obtain a target sample attribute evaluation model.
  • a training sample is first determined through step S201.
  • the training sample may be service data generated by each terminal.
  • the training sample includes black samples that have been labeled with attributes, and also includes unknown attributes. Unknown sample.
  • the training sample is the relevant data of the user who applied for claims. Among them, the sample corresponding to the insurance fraud user is determined to be a black sample. Sample labeling leads to a significant discount in the accuracy of the sample attribute evaluation model. How to solve the problem of model training in this scenario is very important.
  • the method in this embodiment can combine the community attributes of the samples with the semi-supervised machine learning algorithm to mine potential black samples from a large number of unknown samples, to reach the number of black samples required for model training, and filter to obtain white For samples, the purity of the black and white samples is ensured during training, so that the model training is completed, and a highly accurate sample attribute evaluation model is obtained.
  • step S201 the black sample concentration of each community in the relationship map corresponding to the training samples is determined through step S201.
  • a relationship graph including training samples needs to be constructed in advance.
  • each training sample corresponds to a node
  • the constructed relationship graph may include only the nodes corresponding to the training samples, or may be the relationship graph corresponding to the nodes in the entire network.
  • the construction process of the graph can be to obtain the historical events of each node within a predetermined period of time, based on the historical events, determine the relationship graph according to a preset composition method, and use a preset community discovery algorithm to divide the nodes in the relationship graph into communities.
  • Each node corresponds to the community label to which the node belongs.
  • the preset time period can be specified in advance, and the preset composition method needs to define the following: the definition of the node, the definition of the edge, and the definition of the weight value of the edge.
  • composition rules can be adopted in different scenarios and different implementations.
  • the default composition method can be: take the user as the point, if two users have had financial transactions (such as: transfer) within half a year, then connect the two users, the weight of the edge It can be the number of transfers made by two users.
  • one or more preset community discovery algorithms are run on the relationship graph constructed above, so that each point gets a community label of the community to which the node belongs.
  • the preset community discovery method may be a label propagation algorithm (LPA, Label, Algorithm), or a fast folding algorithm (FU, Fast Unfolding), etc., and there is no limitation in this application.
  • Step1 Each point on the graph uses its own point id as its own label;
  • Step2 Each point gets its neighbor labels from its neighbors
  • Step3 After each point receives labels from all neighbors, it will use the label that appears the most in the received label as its own label (if the graph has the right weight, the one with the highest weight). If there are labels with the same number of labels, then one of these most frequently occurring labels is selected as its own label;
  • Step4 Output the label on each point as its own community label.
  • Step3 Repeat Step2 until all points do not change
  • Step4 Take each community obtained in Step3 as a point, and repeat Step2 until all communities are unchanged;
  • Step5 Output the label on each point as its own community label.
  • the black sample concentration of each community can be calculated.
  • the black sample concentration of each community can be determined in the following three ways:
  • the first type determine the first proportion of all black sample corresponding nodes in each community in the total nodes of the community, and use the first proportion as the black sample concentration of the community.
  • the second method determine the second proportion of the corresponding nodes of all black samples in each community in the total nodes in the relationship graph, and use the first proportion as the black sample concentration of the community.
  • the third type is to determine the third proportion of the nodes corresponding to all black samples in each community in the total nodes of the community, and the fourth proportion of the total nodes in the community to the total nodes in the relationship graph, to obtain the A weighted average of the third ratio and the fourth ratio, and the weighted average is used as the black sample concentration of the community.
  • the black sample concentration may be defined as the number of black samples in a community divided by the total number of community nodes.
  • Community A includes 5 nodes in total, and one node is the node corresponding to the black sample.
  • the black sample concentration of community A is 1/5
  • the black sample concentration of all nodes in the community is 1/5.
  • Community A includes a total of 5 nodes, of which one node is the node corresponding to the black sample, and the relationship graph includes 10 nodes.
  • the black sample concentration of community A is 1/10.
  • the black sample concentration was 1/10 at all nodes.
  • the black sample concentration of community A can be calculated as K1 * 1/5 + K2 * 5/10, where K1 and K2 represent weighting coefficients, which can be set according to actual needs.
  • the black sample concentration of all nodes in the community is 0.2K1 + 0.5k2.
  • the definition method of the black sample concentration can be set according to actual needs, and there is no limitation in this application.
  • step S202 After determining the black sample concentration of each sample, through step S202, determining the white sample sampling probability of each unknown sample based on the black sample concentration of each community, and sampling with the white sample sampling probability of each of the unknown samples, Obtain a white sample.
  • the method of this embodiment combines the community attributes when mining potential black samples from unknown samples. Even if a small part of the unknown samples does not reflect the true characteristics of the black samples at the current moment, deep mining based on the characteristics of the community can truly expand the black samples.
  • the sample proportion meets the requirements for model training.
  • the white sample sampling probability of each unknown sample can be set to a fixed value. For example, if there are 100 unknown samples, the white sample sampling probability of each unknown sample can be set to 1/100 during the first training. In the subsequent multiple rounds of training, the black sample concentration of the unknown sample is combined to determine the white sample sampling probability of the unknown sample.
  • the unknown samples can be white sampled according to their respective white sample sampling probability to determine the extracted white samples, and then, the black samples that have been labeled with attributes are combined in step S203 Based on the semi-supervised machine learning algorithm, the black and white samples are trained to obtain a sample attribute evaluation model.
  • the specific implementation may include the following steps:
  • the semi-supervised machine learning algorithms used include semi-supervised (Positive and Unlabeled Learning) machine learning algorithms. It is a semi-supervised machine learning algorithm, which refers to training machine learning. Among the training samples of the model, only some of the training samples are labeled samples, and the remaining training samples are unlabeled samples. Unlabeled samples are used to assist the learning process of labeled samples. There are only a few labeled black samples in the training samples collected by the modeling party, and the remaining samples are unlabeled unknown samples. The machine learning process is directed at labeled positive samples and unlabeled samples.
  • these training samples can be trained based on semi-supervised machine learning algorithms to build a sample attribute evaluation model.
  • semi-supervised machine learning algorithms multiple machine learning strategies can often be included.
  • semi-supervised machine learning algorithms include typical machine learning strategies, including two types of two-stage strategy and cost-sensitive strategy. The so-called two-stage method, the algorithm first based on known positive samples and unlabeled samples, mining potential unreliable samples in unlabeled samples, and then based on known positive samples and mined reliable negative samples, the problem is transformed into Traditional supervised machine learning process to train classification models.
  • the algorithm assumes that the proportion of positive samples in unlabeled samples is extremely low.
  • a higher cost-sensitive weight is set for positive samples than for negative samples.
  • a higher cost-sensitive weight is usually set for the loss function corresponding to a positive sample in the target equation of a cost-sensitive semi-supervised machine learning algorithm.
  • the samples learn a cost-sensitive classifier to classify unknown samples.
  • the training samples may be trained based on a cost-sensitive semi-supervised machine learning algorithm, or the training samples may be trained using a two-stage method. In the specific implementation process, it can be set as required, and there is no limitation in this application.
  • a two-stage semi-supervised machine learning algorithm is mainly introduced in detail.
  • the black sample in the above training sample set is marked as 1, indicating that the sample is known insurance insurance data, and the white sample is marked as 0, indicating that the training sample corresponding to the insurance data is normal.
  • the sample classification evaluation model is obtained, and then the sample attribute evaluation model is used to evaluate the unknown samples, and each unknown sample is marked as The black sample score of the black sample.
  • the black sample score is a value ranging from 0 to 1, indicating the probability that the unknown sample belongs to the black sample.
  • the black sample white sample and the corresponding black sample score can also be defined in other ways, and this application is not limited here.
  • the training samples are trained for multiple rounds, and the corresponding sample attribute evaluation model is obtained after each round of training. It is necessary to determine whether the sample attribute evaluation model satisfies the preset convergence conditions. If the model converges, the samples obtained from the training The attribute evaluation model is used as the final target sample attribute evaluation model. If the model has not yet converged, the black sample concentration of each unknown sample is updated and the training is continued as described above until the trained model reaches the convergence condition.
  • determining whether the model converges can be implemented by the following steps:
  • each unknown sample is evaluated to obtain the current round of attribute evaluation results for each unknown sample, and a total of M current round of attribute evaluation results are obtained, where M is the number of unknown samples;
  • each unknown sample is evaluated based on the sample attribute evaluation model to obtain the current round of attribute evaluation results for each unknown sample, including:
  • each unknown sample is evaluated to obtain a black sample score for each unknown sample. If the black sample score value is greater than a preset score value, attribute information of the unknown sample is marked as a black sample, where each The attribute evaluation result of the unknown sample includes the attribute information of the unknown sample.
  • a sample attribute evaluation model corresponding to the training is obtained in each round of training, and the model is used to score the black samples of each unknown sample, and the unknown samples may be labeled according to the score.
  • the black sample score value is greater than a preset score value
  • attribute information of the unknown sample is marked as a black sample.
  • the preset score is set to 0.8
  • the black sample score of unknown sample 1 is 0.9
  • the attribute information of the unknown sample 1 is marked as a black sample.
  • the black sample of the unknown sample 2 has a score of 0.4, and the attribute information of the unknown sample 2 remains unchanged or is an unknown attribute sample.
  • the attribute evaluation result of each unknown sample in this round of training can be determined, and the evaluation result may include the black sample score and attribute information of the unknown sample in this round of training. If the number of unknown samples is M, then M rounds of attribute evaluation results will be obtained.
  • the attribute evaluation results corresponding to the unknown samples in the previous training that is, the M attribute evaluation results of the previous round can also be obtained.
  • the attribute information in each evaluation result of the M last round of attribute evaluation results is used to determine which unknown samples are included as black samples in the previous round of training, and the number of The attribute information in each evaluation result in the round attribute evaluation result determines which unknown samples are included as black samples in the current round of training. If the unknown samples marked as black samples in the previous round are consistent with the unknown samples marked as black samples in this round, it means that the label of each unknown sample has not changed, and the model has reached convergence.
  • the black samples in the unknown samples in the previous round of training include unknown samples 1, unknown samples 2, unknown samples 5, and unknown samples 10.
  • the black samples in this round of training also include unknown samples 1, unknown samples 2, unknown samples 5, and unknown samples 10, indicating that the unknown samples have not changed, and the sample attribute evaluation model trained in this round has reached convergence. Use the sample attribute evaluation model obtained in this round of training as the target sample attribute evaluation model.
  • the preset convergence condition for determining whether the model has reached convergence can be set according to actual needs.
  • the above example is only an example of specific implementation, and does not constitute a limitation on this application.
  • it can also be set that if the proportion of unknown samples in the previous round of unknown samples marked as black samples and the unknown samples marked as black samples in the current round reach a preset ratio, indicating the mark of each unknown sample There has been no change and the model has reached convergence.
  • the specific implementation can include the following steps:
  • the unknown samples whose attribute information changes are determined; the black sample concentration of the community corresponding to the unknown samples whose attribute information changes is recalculated.
  • the attribute evaluation result of each unknown sample corresponding to this round of training and the attribute evaluation result of each unknown sample corresponding to the previous round of training it is possible to locate which unknown sample's attribute information changes.
  • the change can be from a black sample attribute to an unknown attribute, or from an unknown attribute to a black sample attribute.
  • locate the community where the unknown sample that has the attribute change is located recalculate the black sample concentration of the community, and update the white sample sampling probability of the corresponding node of the community according to the updated black sample concentration of the community.
  • community A includes nodes corresponding to unknown sample 1, unknown sample 2, and black sample 1.
  • the attributes of all nodes in community A changed, and the density of the black samples corresponding to each node in community A was 1/3.
  • the unknown sample 1 is marked as a black sample, and the remaining node attributes remain unchanged.
  • the black sample concentration corresponding to each node in the community A is updated to 2/3. In this way, the sampling probability of the white samples corresponding to the unknown samples 1 and 2 is 1/3.
  • the white sample sampling probability of the unknown node can be updated, and then the white sample sampling based on the white sample sampling probability of each node is repeatedly performed to obtain the white sample, and the white sample obtained and the known black sample are combined, based on semi-supervision
  • the machine learning algorithm trains the sample attribute evaluation model in the next round until the trained sample attribute evaluation model reaches the above-mentioned preset convergence conditions.
  • the target sample attribute evaluation model is obtained through training in the foregoing manner, and the model can be used to evaluate the sample attributes of the new sample to determine the evaluation result of the new sample.
  • the evaluation result includes the black sample score of the new sample and / or Attribute information. Specifically, in this embodiment, known black samples and white samples with high trust remaining after screening of potential black samples are used for model training.
  • the obtained target sample attribute evaluation model has higher evaluation accuracy, and can be used for new developments.
  • the sample is evaluated for sample attributes.
  • the evaluation result may include the aforementioned black sample score, which indicates the probability that the new sample belongs to the black sample.
  • the evaluation result may also include attribute information of the new sample, for example, the black sample of the new sample has a score of 0.9, which is greater than a preset score of 0.8, and it is determined that the attribute information of the new sample is a black sample.
  • attribute information of the new sample for example, the black sample of the new sample has a score of 0.9, which is greater than a preset score of 0.8, and it is determined that the attribute information of the new sample is a black sample.
  • the relationship graph in the foregoing embodiment may be updated at a preset time interval, and the updated relationship graph may be re-divided into communities. Obtain the black sample concentration of the corresponding community, and re-train the sample attribute evaluation model so that the model can be updated at preset time intervals.
  • the method in this embodiment can be applied to an insurance claim scenario.
  • the training sample is the insurance data of the claimant.
  • the black sample is the insurance data of known fraudsters.
  • the fraud insurance evaluation model is obtained through the aforementioned method. After entering relevant insurance data of personnel into the fraud insurance evaluation model, it can be obtained that the newly-applied claimant belongs to the evaluation score of the fraud insurance personnel or whether it is a fraud insurance property. In this way, the relevant personnel of the insurance company can conduct follow-up related inspections of suspected insurance deceivers based on the results of such assessments, avoiding unnecessary property losses.
  • an embodiment of the present specification provides a sample attribute evaluation model training apparatus. Please refer to FIG. 3, including:
  • a first determining unit 301 configured to determine a black sample concentration of each community in a relationship diagram corresponding to a training sample, where the training sample includes a black sample and an unknown sample;
  • a second determining unit 302 configured to determine a white sample sampling probability of each of the unknown samples based on the black sample concentration of each community, and sample with the white sample sampling probability of each of the unknown samples to obtain a white sample ;
  • a training unit 303 is configured to train the black sample and the white sample based on a semi-supervised machine learning algorithm to obtain a target sample attribute evaluation model.
  • the training unit 303 is specifically configured to:
  • a sample attribute evaluation model that satisfies the preset convergence condition is used as a target sample attribute evaluation model.
  • the training unit 303 is specifically configured to:
  • the training unit 303 is specifically configured to:
  • the training unit 303 is specifically configured to:
  • the training unit 303 is specifically configured to:
  • the apparatus further includes an evaluation unit, where the evaluation unit is specifically configured to:
  • the evaluation result includes the black sample score and / or attribute information of the newly injected sample.
  • the training sample is insurance data corresponding to a claim applicant
  • the black sample is insurance data corresponding to a fraud insurance person.
  • the first determining unit is specifically configured to:
  • the present invention further provides a server, as shown in FIG. 4, including a memory 404, a processor 402, and a memory 404 stored in A computer program running on a processor 402.
  • a server as shown in FIG. 4, including a memory 404, a processor 402, and a memory 404 stored in A computer program running on a processor 402.
  • the processor 402 executes the program, the steps of any method of the sample attribute evaluation method described above are implemented.
  • the bus architecture (represented by the bus 400).
  • the bus 400 may include any number of interconnected buses and bridges.
  • the bus 400 will include one or more processors represented by the processor 402 and memory 404.
  • the various circuits of the memory are linked together.
  • the bus 400 can also link various other circuits such as peripheral devices, voltage regulators, and power management circuits, which are well known in the art, and therefore, they will not be further described herein.
  • the bus interface 406 provides an interface between the bus 400 and the receiver 401 and the transmitter 403.
  • the receiver 401 and the transmitter 403 may be the same element, that is, a transceiver, providing a unit for communicating with various other devices on a transmission medium.
  • the processor 402 is responsible for managing the bus 400 and general processing, and the memory 404 may be used to store data used by the processor 402 when performing operations.
  • the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the sample attribute evaluation described above. The steps of any method.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a particular manner such that the instructions stored in the computer-readable memory produce a manufactured article including the instruction device, the instructions
  • the device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本说明书实施例提供了一种样本属性评估方法,首先确定训练样本,该训练样本中仅包括少量已确认属性的黑样本,还有大部分未确认属性的未知样本。基于训练样本对应的关系图,确定每个社区的黑样本浓度,结合社区黑样本浓度以及半监督机器学习算法,即使黑样本数量较少,本实施例中的方法也可以从未知样本中挖掘潜在黑样本,进而确定模型训练所需要的白样本,达到模型训练要求,使得训练出的模型能够准确地对样本是否属于黑样本的属性进行评估。

Description

样本属性评估模型训练方法、装置及服务器 技术领域
本说明书实施例涉及互联网技术领域,尤其涉及一种样本属性评估模型训练方法、装置及服务器。
背景技术
随着互联网的快速发展,越来越多的业务可以通过网络实现,如在线支付、在线购物、线上保险理赔等互联网业务。互联网在给人们生活提供便利的同时,也带来了风险。不法人员可能会进行电子业务欺诈,给其它用户造成损失。对于庞大的业务样本集而言,明确属性为黑的风险黑样本数量较少,大部分是未知属性的样本,由于业务欺诈数据样本具有隐藏性,所以,为了能够提升整体风控能力,亟需设计一种能够基于少量已知黑样本训练得到能够准确对未知样本进行属性评估的方案。
发明内容
本说明书实施例提供及一种样本属性评估方法、装置及服务器。
第一方面,本说明书实施例提供一种样本属性评估方法,包括:确定与训练样本对应的关系图中每个社区的黑样本浓度,其中,所述训练样本包括黑样本和未知样本;基于所述每个社区的黑样本浓度,确定每个所述未知样本的白样本抽样概率,以每个所述未知样本的白样本抽样概率进行抽样,获得白样本;基于半监督机器学习算法对所述黑样本与所述白样本进行训练,获得目标样本属性评估模型。
第二方面,本说明书实施例提供一种样本属性评估模型训练装置,包括:第一确定单元,用于确定与训练样本对应的关系图中每个社区的黑样本浓度,其中,所述训练样本包括黑样本和未知样本;第二确定单元,用于基于所述每个社区的黑样本浓度,确定每个所述未知样本的白样本抽样概率,以每个所述未知样本的白样本抽样概率进行抽样,获得白样本;训练单元,用于基半监督机器学习算法对所述黑样本与所述白样本进行训练,获得目标样本属性评估模型。
第三方面,本说明书实施例提供一种服务器,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述任一项所述 样本属性评估方法的步骤。
第四方面,本说明书实施例提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述任一项所述样本属性评估方法的步骤。
本说明书实施例有益效果如下:
本说明书实施例中,通过确定训练样本,该训练样本中仅包括少量已确认属性的黑样本,还有大部分未确认属性的未知样本。基于训练样本对应的关系图,确定每个社区的黑样本浓度,结合社区黑样本浓度以及半监督机器学习算法,即使已知黑样本数量较少,本实施例中的方法也可以从未知样本中挖掘潜在黑样本,进而确定模型训练所需要的白样本,达到模型训练要求,使得训练出的模型能够准确地对样本是否属于黑样本的属性进行评估。
附图说明
图1为本说明书实施例样本属性评估应用场景示意图;
图2为本说明书实施例第一方面样本属性评估方法流程图;
图3为本说明书实施例第二方面样本属性评估模型训练装置结构示意图;
图4为本说明书实施例第三方面样本属性评估服务器结构示意图。
具体实施方式
为了更好的理解上述技术方案,下面通过附图以及具体实施例对本说明书实施例的技术方案做详细的说明,应当理解本说明书实施例以及实施例中的具体特征是对本说明书实施例技术方案的详细的说明,而不是对本说明书技术方案的限定,在不冲突的情况下,本说明书实施例以及实施例中的技术特征可以相互组合。
请参见图1,为本说明书实施例的样本属性评估应用场景示意图。终端100位于用户侧,与网络侧的服务器200通信。用户可通过终端100中的APP或网站产生实时事件,一些业务数据。服务器200收集各个终端产生的实时事件,即可挑选出训练样本。本说明书实施例可应用于风险样本识别或保险理赔中骗保样本识别等风控场景,也可以应用于二分类的分类场景。
第一方面,本说明书实施例提供一种样本属性评估方法,请参考图2,包括步骤 S201-S203。
S201:确定与训练样本对应的关系图中每个社区的黑样本浓度,其中,所述训练样本包括黑样本和未知样本;
S202:基于所述每个社区的黑样本浓度,确定每个所述未知样本的白样本抽样概率,以每个所述未知样本的白样本抽样概率进行抽样,获得白样本;
S203:基于半监督机器学习算法对所述黑样本与所述白样本进行训练,获得目标样本属性评估模型。
具体的,在本实施例中,首先通过步骤S201确定训练样本,训练样本如前述所示,可以是各个终端侧产生的业务数据,训练样本中包括已经标记好属性的黑样本,还包括未知属性的未知样本。例如:在保险理赔场景中,训练样本为申请理赔的用户的相关数据,其中,确定骗保用户对应的样本为黑样本,保险理赔场景中已定骗保事实的黑样本较少,缺乏大量黑样本标记,从而导致样本属性评估模型精准度大大折扣,如何解决这种场景下的模型训练问题是非常重要工作。本实施例中的方法,可以结合样本的社区属性与半监督机器学习算法,来从大量未知样本中挖掘潜在的黑样本,达到模型训练所需要的黑样本数量,过滤得到信任度较高的白样本,训练时确保了黑样本和白样本的纯度,从而完成模型训练,得到精度较高的样本属性评估模型。
进一步,再通过步骤S201确定与训练样本对应的关系图中每个社区的黑样本浓度。
具体的,在本实施例中,需要预先构建包括训练样本的关系图。具体的,每个训练样本对应一个节点,构建的关系图中可以仅包括训练样本对应的节点,还可以是全网节点对应的关系图。
图的构建过程可以是获取各节点在预定时间段内的历史事件,基于历史事件,按预设构图方法确定关系图,采用预设社区发现算法对关系图中的节点进行社区划分,其中,每个节点对应有该节点所属的社区标签。其中,预设时间段可以预先指定,预设构图方法需要定义以下各个内容:节点的定义,边的定义以及边的权重值的定义。
本实施例也不限制具体的构图规则。不同的场景、不同实现中可以采用不同的构图规则。举例而言,在保险理赔场景中,预设构图方法可以是:以用户为点,若在半年内两个用户有过金融交易(如:转账),则将两个用户连接起来,边的权重可以是两个用户转账的次数。
具体的,在本实施例中,针对上述构建的关系图上给运行一个或多个预设社区发现 算法,这样,每一个点得到一个该节点所属社区的社区标签。预设社区发现方法可以是标签传播算法(LPA,Label Propagation Algorithm),也可以是快速折叠算法(FU,Fast Unfolding)等等,在此,本申请不做限制。
其中,标签传播算法流程简述如下:
Step1:图上的每一个点都以自己点id作为自己的标签;
Step2:每一个点都从自己的邻居那获取各邻居标签;
Step3:每一个点收到来自所有邻居的标签之后,将收到标签中出现最多的作为自己的标签(如果有权图则是权重和最高的那个)。如果出现标签数相同多的标签,则在这些出现最多的标签中任选一个作为自己的标签;
Step4:将每个点上的标签作为自己的社区标签输出。
Step3:重复Step2直到所有点都不发生变化;
Step4:将Step3得到的每一个社区当成点,重复Step2直到所有社区不发生变化;
Step5:将每个点上的标签作为自己的社区标签输出。
在对关系图划分好社区后,即可计算得到每个社区的黑样本浓度,每个社区的黑样本浓度的确定方式包括但不限于以下三种:
第一种:确定每个社区中所有黑样本对应节点在该社区总节点中的第一占比,将所述第一占比作为该社区的黑样本浓度。
第二种:确定每个社区中所有黑样本对应节点在所述关系图中总节点中的第二占比,将所述第一占比作为该社区的黑样本浓度。
第三种:确定每个社区中所有黑样本对应节点在该社区总节点中的第三占比,以及该社区总节点在所述关系图中的总节点中的第四占比,获得所述第三占比与所述第四占比的加权平均值,将所述加权平均值作为该社区的黑样本浓度。
具体的,在本实施例中,采用第一种方式,黑样本浓度可以定义为社区内的黑样本个数除以社区节点总数。例如:社区A内总共包括5个节点,其中,有一个节点是黑样本对应的节点,这样,可计算得到社区A的黑样本浓度为1/5,该社区内所有节点的黑样本浓度均为1/5。
当然,还可以通过整个关系图规模,采用上述第二种方式定义。例如:社区A内总 共包括5个节点,其中,有一个节点是黑样本对应的节点,关系图中包括10个节点,这样,可计算得到社区A的黑样本浓度为1/10,该社区内所有节点的黑样本浓度均为1/10。
当然,可以结合社区规模以及黑样本在社区中的占比两个维度来设定,采用上述第三种方式定义。例如:社区A内总共包括5个节点,其中,有一个节点是黑样本对应的节点,关系图中包括10个节点,这样,可计算得到社区A的黑样本浓度为K1*1/5+K2*5/10,其中,K1与K2表示加权系数,可根据实际需要继续进行设定,则该社区内所有节点的黑样本浓度均为0.2K1+0.5k2。在具体实施过程中,黑样本浓度的定义方式可根据实际需要进行设定,在此,本申请不做限制。
在确定好各个样本的黑样本浓度后,通过步骤S202,基于每个社区的黑样本浓度,确定每个未知样本的白样本抽样概率,以每个所述未知样本的白样本抽样概率进行抽样,获得白样本。
本实施例的方法,在从未知样本中挖掘潜在黑样本时,结合社区属性,即使小部分未知样本在当前时刻没有体现出来黑样本真实特征,结合社区特性,进行深度挖掘,可以真实的扩大黑样本比例,达到模型训练要求。具体的,在本实施例中,针对训练样本中的每个未知样本,可以根据该样本的黑样本浓度确定该样本的白样本抽样概率。比如:如果未知样本1位于社区A,该社区A的黑样本浓度为1/5,所以,未知样本1的黑样本浓度为1/5,未知样本1的白样本抽样概率P1=1-1/5=4/5。进一步,在初始的第一次训练时,可以将每个未知样本的白样本抽样概率设定为固定值。比如:有100个未知样本,在第一次训练时可将每个未知样本的白样本抽样概率设定为1/100。在后续的多轮训练中再结合未知样本的黑样本浓度确定该未知样本的白样本抽样概率。
这样,在确定好各个未知样本的白样本抽样概率后,可以对未知样本按各自的白样本抽样概率进行白样本抽样,确定抽取到的白样本,然后,通过步骤S203结合已经标记属性的黑样本,基于半监督机器学习算法对黑样本与白样本进行训练,获得样本属性评估模型。具体实现可包括如下步骤:
基于半监督机器学习算法对黑样本与白样本进行训练,获得样本属性评估模型;
判断样本属性评估模型是否满足预设收敛条件;
如果否,更新每个社区的黑样本浓度,基于更新后的每个社区的黑样本浓度与半监督机器学习算法继续训练,直至训练得到的样本属性评估模型满足预设收敛条件,将满 足预设收敛条件的样本属性评估模型作为目标样本属性评估模型。
本实施例中,采用半监督机器学习算法包括半监督(Positive and Unlabeled Learning,正样本和无标记学习)机器学习算法,它是一种半监督学习的机器学习算法,是指用于训练机器学习模型的训练样本中,仅部分训练样本是有标记样本,而其余的训练样本为无标记样本,利用无标记样本来辅助有标记样本的学习过程。应用于建模一方收集到的训练样本中只有少量有标记的黑样本,其余的样本均为无标记的未知样本,针对有标记的正样本和无标记样本的机器学习过程。
在构建好黑样本和白样本后,可以基于半监督机器学习算法对这些训练样本进行训练,来构建样本属性评估模型。对于半监督机器学习算法而言,通常可以包含多种机器学习策略。例如:半监督机器学习算法包含典型的机器学习策略,包括两阶段法(two-stage strategy)和代价敏感法(cost-sensitive strategy)两类。所谓两阶段法,算法首先基于已知的正样本和无标记样本,在无标记样本中挖掘发现潜在的可靠负样本,然后基于已知的正样本和挖掘出来的可靠负样本,将问题转化为传统的有监督的机器学习的过程,来训练分类模型。
而对于代价敏感的策略而言,算法假设无标记样本中正样本的比例极低,通过直接将无标记样本看作负样本对待,为正样本设置一个相对于负样本更高的代价敏感权重。例如,通常会在基于代价敏感的半监督机器学习算法的目标方程中,为与正样本对应的损失函数,设置一个更高的代价敏感权重。通过给正样本设置更高的代价敏感权重,使得最终训练出的分类模型分错一个正样本的代价远远大于分错一个负样本的代价,如此一来,可以直接通过利用正样本和无标记样本(当作负样本)学习一个代价敏感的分类器,来对未知的样本进行分类。在本实施例中,既可以基于代价敏感的半监督机器学习算法对上述训练样本进行训练,也可以采用两阶段法对上述训练样本进行训练。在具体实施过程中,可根据需要进行设定,在此,本申请不做限制。
在本实施例中主要以两阶段的半监督机器学习算法进行详细介绍。以保险理赔场景为例,假设上述训练样本集中的黑样本被标记为1,表示该样本为已知的骗保的保险数据,白样本标记为0,表示该训练样本对应保险数据是正常的。在对黑样本和基于白样本抽样概率抽样出的白样本进行二分类模型训练后,得到样本属性评估模型,然后再采用该样本属性评估模型对未知样本进行评估,得到每个未知样本被标记为黑样本的黑样本评分,该黑样本评分为一个范围在0~1的数值,表明未知样本属于黑样本的概率。当然,还可以以其他方式定义黑样本白样本以及对应的黑样本评分,在此,本申请不做限 制。
按照这样的方式对训练样本进行多轮训练,每轮训练后得到对应的样本属性评估模型,需要判断该样本属性评估模型是否满足预设收敛条件,如果模型收敛,则将该轮训练得到的样本属性评估模型作为最终的目标样本属性评估模型。如果模型还没有收敛,则更新每个未知样本的黑样本浓度后继续按照前述方式进行训练,直至训练得到的模型达到收敛条件。
在本实施例中,判断模型是否收敛可以通过如下步骤实现:
基于样本属性评估模型对每个未知样本进行评估,获得每个未知样本的本轮属性评估结果,共计获得M个本轮属性评估结果,M为未知样本的个数;
基于M个本轮属性评估结果与M个上一轮属性评估结果,判断样本属性评估模型是否满足预设收敛条件。
其中,基于样本属性评估模型对每个未知样本进行评估,获得每个未知样本的本轮属性评估结果,包括:
基于样本属性评估模型对每个未知样本进行评估,获得每个未知样本的黑样本评分,如果黑样本评分值大于预设分值,将该未知样本的属性信息标记为黑样本,其中,每个未知样本的本轮属性评估结果中包括该未知样本的属性信息。
具体的,在本实施例中,在每轮训练得到该轮训练对应的样本属性评估模型,利用该模型对每个未知样本的黑样本评分,可以根据评分对该未知样本进行标记。具体的,如果黑样本评分值大于预设分值,将该未知样本的属性信息标记为黑样本。举例来说,预设分值设定为0.8,未知样本1的黑样本评分为0.9,将该未知样本1的属性信息标记为黑样本。未知样本2的黑样本评分为0.4,将该未知样本2的属性信息保持不变,还是未知属性样本。通过这样的方式,可以确定出每个未知样本在该轮训练中的属性评估结果,该评估结果中可包括该未知样本在本轮训练中的黑样本评分和属性信息。未知样本个数为M,则得到M个本轮属性评估结果。
进而,还可获得未知样本在上一轮训练对应的属性评估结果,即M个上一轮属性评估结果。通过M个本轮属性评估结果与M个上一轮属性评估结果,即可判断样本属性评估模型是否满足预设收敛条件,具体可通过如下步骤实现:
判断每个未知样本的本轮属性评估结果中的属性信息与该未知样本的上一轮属性评估结果中的属性信息是否一致,如果是,表明本轮样本属性评估模型满足预设收敛条件。
具体的,在本实施例中,通过M个上一轮属性评估结果中每个评估结果中的属性信息,确定在上一轮训练中被标记为黑样本包括哪些未知样本,以及通过M个本轮属性评估结果中每个评估结果中的属性信息,确定在本轮训练中被标记为黑样本包括哪些未知样本。如果上一轮被标记为黑样本的未知样本与本轮被标记为黑样本的未知样本一致,表明每个未知样本的标记已经没有变化,模型达到收敛。举例而言,上一轮中训练中未知样本中的黑样本包括未知样本1、未知样本2、未知样本5、未知样本10。本轮训练中的黑样本也包括未知样本1、未知样本2、未知样本5、未知样本10,表明未知样本没有变化,本轮训练出的样本属性评估模型已达到收敛。将该轮训练得到的样本属性评估模型作为目标样本属性评估模型。
进一步,判定模型是否达到收敛的预设收敛条件可以根据实际需要进行设定,上述示例只是具体实现的一种示例,并不对本申请构成限定。例如:还可以设定为如果上一轮被标记为黑样本的未知样本与本轮被标记为黑样本的未知样本一致的未知样本数量占比达到预设占比,表明每个未知样本的标记已经没有变化,模型达到收敛。
进一步,如果确定本轮训练得到的样本属性评估模型不满足预设收敛条件,则表明模型还没有收敛,需要进行下一轮训练。在进行下一轮训练之前,由于标记的黑样本相对于上一轮训练发生了变化,所以,需要根据标记的黑样本对社区的黑样本浓度进行更新,进而对每个未知样本的黑样本浓度进行更新,具体实现可包括如下步骤:
基于M个本轮属性评估结果与M个上一轮属性评估结果,确定属性信息发生变化的未知样本;重新计算与属性信息发生变化的未知样本对应的社区的黑样本浓度。
具体的,在本实施例中,基于本轮训练对应的每个未知样本的属性评估结果与上一轮训练对应的每个未知样本的属性评估结果,可以定位出哪些未知样本的属性信息发生变化,该变化可以是由黑样本属性变更为未知属性,还可以是由未知属性变更为黑样本属性。进而定位到产生属性变化的未知样本所在社区,重新计算该社区的黑样本浓度,根据该社区更新后的黑样本浓度,更新该社区对应节点的白样本抽样概率。举例来说,社区A包括未知样本1、未知样本2、黑样本1对应的节点。上一轮训练中社区A中的所有节点属性均为发生改变,社区A中每个节点对应的黑样本浓度为1/3。此轮训练中将未知样本1标记为黑样本,其余节点属性未发生变化,将社区A中每个节点对应的黑样本浓度更新为2/3。这样,未知样本1、未知样本2对应的白样本抽样概率均为1/3。
按照这样的方式,可以更新未知节点的白样本抽样概率,然后重复执行前述按各个节点的白样本抽样概率的白样本抽样得到白样本,结合抽样得到的白样本与已知黑样本, 基于半监督机器学习算法进行下一轮的样本属性评估模型的训练,直至训练得到的样本属性评估模型达到上述预设收敛条件。
进而,通过前述方式训练得到目标样本属性评估模型,可用该模型对新进样本进行样本属性评估,确定新进样本的评估结果,其中,评估结果中包括该新进样的黑样本评分和/或属性信息。具体的,本实施例中采用已知黑样本和筛选出潜在黑样本后的剩余的信任度较高的白样本进行模型训练,得到的目标样本属性评估模型的评估精度较高,可以对新进样本进行样本属性评估,评估结果可以包括前述的黑样本评分,表明该新进样本属于黑样本的概率。评估结果也可以包括该新进样本的属性信息,例如:该新进样本的黑样本评分为0.9,大于预设分值0.8,确定该新进样本的属性信息为黑样本。通过该评估结果,相关人员即可及时获知该新进样本的属性,及时进行风险调控。
进一步,在本实施例中,由于节点间的关系会随着时间发生变化,所以,可以按照预设时间间隔对前述实施例中的关系图进行更新,对更新后的关系图重新进行社区划分,得到对应社区的黑样本浓度,重新进行样本属性评估模型的训练,以使得模型能够按预设时间间隔更新。
本实施例中的方法可以应用于保险理赔场景,训练样本为申请理赔人员的保险数据,黑样本为已知的骗保人员的保险数据,通过前述方式获得骗保评估模型,新进的申请理赔人员的相关保险数据输入该骗保评估模型后即可得到该新进的申请理赔人员属于骗保人员的评估得分或是否为骗保的属性。这样,保险公司相关人员就可以根据这样的评估结果对疑似骗保人员进行后续相关审查,避免了不必要的财产损失。
第二方面,基于同一发明构思,本说明书实施例提供一种样本属性评估模型训练装置,请参考图3,包括:
第一确定单元301,用于确定与训练样本对应的关系图中每个社区的黑样本浓度,其中,所述训练样本包括黑样本和未知样本;
第二确定单元302,用于基于所述每个社区的黑样本浓度,确定每个所述未知样本的白样本抽样概率,以每个所述未知样本的白样本抽样概率进行抽样,获得白样本;
训练单元303,用于基半监督机器学习算法对所述黑样本与所述白样本进行训练,获得目标样本属性评估模型。
在一种可选实现方式中,所述训练单元303具体用于:
基于半监督机器学习算法对所述黑样本与所述白样本进行训练,获得样本属性评估 模型;
判断所述样本属性评估模型是否满足预设收敛条件;
如果否,更新所述每个社区的黑样本浓度,基于更新后的每个社区的黑样本浓度与所述半监督机器学习算法继续训练,直至训练得到的样本属性评估模型满足所述预设收敛条件,将满足所述预设收敛条件的样本属性评估模型作为目标样本属性评估模型。
在一种可选实现方式中,所述训练单元303具体用于:
基于所述样本属性评估模型对每个所述未知样本进行评估,获得每个所述未知样本的本轮属性评估结果,共计获得M个本轮属性评估结果,M为未知样本的个数;
基于所述M个本轮属性评估结果与M个上一轮属性评估结果,判断所述样本属性评估模型是否满足预设收敛条件。
在一种可选实现方式中,所述训练单元303具体用于:
基于所述样本属性评估模型对每个所述未知样本进行评估,获得每个所述未知样本的黑样本评分,如果黑样本评分值大于预设分值,将该未知样本的属性信息标记为黑样本,其中,每个所述未知样本的本轮属性评估结果中包括该未知样本的属性信息。
在一种可选实现方式中,所述训练单元303具体用于:
判断每个未知样本的本轮属性评估结果中的属性信息与该未知样本的上一轮属性评估结果中的属性信息是否一致,如果是,表明所述本轮样本属性评估模型满足所述预设收敛条件。
在一种可选实现方式中,所述训练单元303具体用于:
基于所述M个本轮属性评估结果与M个上一轮属性评估结果,确定属性信息发生变化的未知样本;
重新计算与所述属性信息发生变化的未知样本对应的社区的黑样本浓度。
在一种可选实现方式中,所述装置还包括评估单元,所述评估单元具体用于:
在所述将满足所述预设收敛条件的样本属性评估模型作为目标样本属性评估模型之后,根据目标样本属性评估模型,对新进样本进行评估,确定所述新进样本的评估结果,其中,所述评估结果中包括该新进样的黑样本评分和/或属性信息。
在一种可选实现方式中,所述训练样本为申请理赔人员对应的保险数据,所述黑样 本为骗保人员对应保险数据。
在一种可选实现方式中,所述第一确定单元具体用于:
确定每个社区中所有黑样本对应节点在该社区总节点中的第一占比,将所述第一占比作为该社区的黑样本浓度;或
确定每个社区中所有黑样本对应节点在所述关系图中总节点中的第二占比,将所述第一占比作为该社区的黑样本浓度;或
确定每个社区中所有黑样本对应节点在该社区总节点中的第三占比,以及该社区总节点在所述关系图中的总节点中的第四占比,获得所述第三占比与所述第四占比的加权平均值,将所述加权平均值作为该社区的黑样本浓度。
第三方面,基于与前述实施例中样本属性评估方法同样的发明构思,本发明还提供一种服务器,如图4所示,包括存储器404、处理器402及存储在存储器404上并可在处理器402上运行的计算机程序,所述处理器402执行所述程序时实现前文所述样本属性评估方法的任一方法的步骤。
其中,在图4中,总线架构(用总线400来代表),总线400可以包括任意数量的互联的总线和桥,总线400将包括由处理器402代表的一个或多个处理器和存储器404代表的存储器的各种电路链接在一起。总线400还可以将诸如***设备、稳压器和功率管理电路等之类的各种其他电路链接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口406在总线400和接收器401和发送器403之间提供接口。接收器401和发送器403可以是同一个元件,即收发机,提供用于在传输介质上与各种其他装置通信的单元。处理器402负责管理总线400和通常的处理,而存储器404可以被用于存储处理器402在执行操作时所使用的数据。
第四方面,基于与前述实施例中样本属性评估的发明构思,本发明还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前文所述样本属性评估的方法的任一方法的步骤。
本说明书是参照根据本说明书实施例的方法、设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执 行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的设备。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令设备的制造品,该指令设备实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本说明书的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本说明书范围的所有变更和修改。
显然,本领域的技术人员可以对本说明书进行各种改动和变型而不脱离本说明书的精神和范围。这样,倘若本说明书的这些修改和变型属于本说明书权利要求及其等同技术的范围之内,则本说明书也意图包含这些改动和变型在内。

Claims (19)

  1. 一种样本属性模型训练方法,包括:
    确定与训练样本对应的关系图中每个社区的黑样本浓度,其中,所述训练样本包括黑样本和未知样本;
    基于所述每个社区的黑样本浓度,确定每个所述未知样本的白样本抽样概率,以每个所述未知样本的白样本抽样概率进行抽样,获得白样本;
    基于半监督机器学习算法对所述黑样本与所述白样本进行训练,获得目标样本属性评估模型。
  2. 根据权利要求1所述的方法,所述确定与训练样本对应的关系图中每个社区的黑样本浓度,包括:
    确定每个社区中所有黑样本对应节点在该社区总节点中的第一占比,将所述第一占比作为该社区的黑样本浓度;或
    确定每个社区中所有黑样本对应节点在所述关系图中总节点中的第二占比,将所述第一占比作为该社区的黑样本浓度;或
    确定每个社区中所有黑样本对应节点在该社区总节点中的第三占比,以及该社区总节点在所述关系图中的总节点中的第四占比,获得所述第三占比与所述第四占比的加权平均值,将所述加权平均值作为该社区的黑样本浓度。
  3. 根据权利要求1所述的方法,所述基于半监督机器学习算法对所述黑样本与所述白样本进行训练,获得目标样本属性评估模型,包括:
    基于半监督机器学习算法对所述黑样本与所述白样本进行训练,获得样本属性评估模型;
    判断所述样本属性评估模型是否满足预设收敛条件;
    如果否,更新所述每个社区的黑样本浓度,基于更新后的每个社区的黑样本浓度与所述半监督机器学习算法继续训练,直至训练得到的样本属性评估模型满足所述预设收敛条件,将满足所述预设收敛条件的样本属性评估模型作为目标样本属性评估模型。
  4. 根据权利要求3所述的方法,所述判断所述样本属性评估模型是否满足预设收敛条件,包括:
    基于所述样本属性评估模型对每个所述未知样本进行评估,获得每个所述未知样本的本轮属性评估结果,共计获得M个本轮属性评估结果,M为未知样本的个数;
    基于所述M个本轮属性评估结果与M个上一轮属性评估结果,判断所述样本属性评估模型是否满足预设收敛条件。
  5. 根据权利要求4所述的方法,所述基于所述样本属性评估模型对每个所述未知样本进行评估,获得每个所述未知样本的本轮属性评估结果,包括:
    基于所述样本属性评估模型对每个所述未知样本进行评估,获得每个所述未知样本的黑样本评分,如果黑样本评分值大于预设分值,将该未知样本的属性信息标记为黑样本,其中,每个所述未知样本的本轮属性评估结果中包括该未知样本的属性信息。
  6. 根据权利要求5所述的方法,所述基于所述M个本轮属性评估结果与M个上一轮属性评估结果,判断所述样本属性评估模型是否满足预设收敛条件,包括:
    判断每个未知样本的本轮属性评估结果中的属性信息与该未知样本的上一轮属性评估结果中的属性信息是否一致,如果是,表明所述本轮样本属性评估模型满足所述预设收敛条件。
  7. 根据权利要求5所述的方法,所述更新所述每个社区的黑样本浓度,包括:
    基于所述M个本轮属性评估结果与M个上一轮属性评估结果,确定属性信息发生变化的未知样本;
    重新计算与所述属性信息发生变化的未知样本对应的社区的黑样本浓度。
  8. 根据权利要求1-7中任一权利要求所述的方法,所述训练样本为申请理赔人员对应的保险数据,所述黑样本为骗保人员对应保险数据。
  9. 一种样本属性评估方法,包括:
    根据权利要求1-7中任一权利要求所述的方法训练得到的目标样本属性评估模型,对新进样本进行评估,确定所述新进样本的评估结果,其中,所述评估结果中包括所述新进样本的黑样本评分和/或属性信息。
  10. 一种样本属性评估模型训练装置,包括:
    第一确定单元,用于确定与训练样本对应的关系图中每个社区的黑样本浓度,其中,所述训练样本包括黑样本和未知样本;
    第二确定单元,用于基于所述每个社区的黑样本浓度,确定每个所述未知样本的白样本抽样概率,以每个所述未知样本的白样本抽样概率进行抽样,获得白样本;
    训练单元,用于基半监督机器学习算法对所述黑样本与所述白样本进行训练,获得目标样本属性评估模型。
  11. 根据权利要求10所述的装置,所述第一确定单元具体用于:
    确定每个社区中所有黑样本对应节点在该社区总节点中的第一占比,将所述第一占比作为该社区的黑样本浓度;或
    确定每个社区中所有黑样本对应节点在所述关系图中总节点中的第二占比,将所述 第一占比作为该社区的黑样本浓度;或
    确定每个社区中所有黑样本对应节点在该社区总节点中的第三占比,以及该社区总节点在所述关系图中的总节点中的第四占比,获得所述第三占比与所述第四占比的加权平均值,将所述加权平均值作为该社区的黑样本浓度。
  12. 根据权利要求10所述的装置,所述训练单元具体用于:
    基于半监督机器学习算法对所述黑样本与所述白样本进行训练,获得样本属性评估模型;
    判断所述样本属性评估模型是否满足预设收敛条件;
    如果否,更新所述每个社区的黑样本浓度,基于更新后的每个社区的黑样本浓度与所述半监督机器学习算法继续训练,直至训练得到的样本属性评估模型满足所述预设收敛条件,将满足所述预设收敛条件的样本属性评估模型作为目标样本属性评估模型。
  13. 根据权利要求12所述的装置,所述训练单元具体用于:
    基于所述样本属性评估模型对每个所述未知样本进行评估,获得每个所述未知样本的本轮属性评估结果,共计获得M个本轮属性评估结果,M为未知样本的个数;
    基于所述M个本轮属性评估结果与M个上一轮属性评估结果,判断所述样本属性评估模型是否满足预设收敛条件。
  14. 根据权利要求13所述的装置,所述训练单元具体用于:
    基于所述样本属性评估模型对每个所述未知样本进行评估,获得每个所述未知样本的黑样本评分,如果黑样本评分值大于预设分值,将该未知样本的属性信息标记为黑样本,其中,每个所述未知样本的本轮属性评估结果中包括该未知样本的属性信息。
  15. 根据权利要求14所述的装置,所述训练单元具体用于:
    判断每个未知样本的本轮属性评估结果中的属性信息与该未知样本的上一轮属性评估结果中的属性信息是否一致,如果是,表明所述本轮样本属性评估模型满足所述预设收敛条件。
  16. 根据权利要求14所述的装置,所述训练单元具体用于:
    基于所述M个本轮属性评估结果与M个上一轮属性评估结果,确定属性信息发生变化的未知样本;
    重新计算与所述属性信息发生变化的未知样本对应的社区的黑样本浓度。
  17. 一种样本属性评估装置,包括:
    评估单元,用于根据权利要求10-16中任一权利要求所述的装置训练得到的目标样本属性评估模型,对新进样本进行评估,确定所述新进样本的评估结果,其中,所述评 估结果中包括所述新进样本的黑样本评分和/或属性信息。
  18. 一种服务器,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1-9任一项所述方法的步骤。
  19. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现权利要求1-9任一项所述方法的步骤。
PCT/CN2019/096287 2018-08-31 2019-07-17 样本属性评估模型训练方法、装置及服务器 WO2020042795A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811015607.1A CN109325525A (zh) 2018-08-31 2018-08-31 样本属性评估模型训练方法、装置及服务器
CN201811015607.1 2018-08-31

Publications (1)

Publication Number Publication Date
WO2020042795A1 true WO2020042795A1 (zh) 2020-03-05

Family

ID=65263715

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096287 WO2020042795A1 (zh) 2018-08-31 2019-07-17 样本属性评估模型训练方法、装置及服务器

Country Status (3)

Country Link
CN (1) CN109325525A (zh)
TW (1) TWI726341B (zh)
WO (1) WO2020042795A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709833A (zh) * 2020-06-16 2020-09-25 中国银行股份有限公司 用户信用的评估方法及装置
CN111931912A (zh) * 2020-08-07 2020-11-13 北京推想科技有限公司 网络模型的训练方法及装置,电子设备及存储介质
CN112231929A (zh) * 2020-11-02 2021-01-15 北京空间飞行器总体设计部 一种基于轨道参数的评估场景大样本生成方法
CN113343051A (zh) * 2021-06-04 2021-09-03 全球能源互联网研究院有限公司 一种异常sql检测模型构建方法及检测方法
CN113779150A (zh) * 2021-09-14 2021-12-10 杭州数梦工场科技有限公司 一种数据质量评估方法及装置
CN116579651A (zh) * 2023-05-11 2023-08-11 中国矿业报社 一种基于半监督学习的矿业项目评价方法

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325525A (zh) * 2018-08-31 2019-02-12 阿里巴巴集团控股有限公司 样本属性评估模型训练方法、装置及服务器
CN110020670B (zh) * 2019-03-07 2023-07-18 创新先进技术有限公司 一种模型迭代方法、装置及设备
CN110311902B (zh) * 2019-06-21 2022-04-22 北京奇艺世纪科技有限公司 一种异常行为的识别方法、装置及电子设备
CN110335140B (zh) * 2019-06-27 2021-09-24 上海淇馥信息技术有限公司 基于社交关系预测贷款黑中介的方法、装置、电子设备
CN110807643A (zh) * 2019-10-11 2020-02-18 支付宝(杭州)信息技术有限公司 一种用户信任评估方法、装置及设备
US11775822B2 (en) * 2020-05-28 2023-10-03 Macronix International Co., Ltd. Classification model training using diverse training source and inference engine using same
CN111881289B (zh) * 2020-06-10 2023-09-08 北京启明星辰信息安全技术有限公司 分类模型的训练方法、数据风险类别的检测方法及装置
JP7062747B1 (ja) * 2020-12-25 2022-05-06 楽天グループ株式会社 情報処理装置、情報処理方法およびプログラム
TWI771098B (zh) * 2021-07-08 2022-07-11 國立陽明交通大學 路側單元之雷達系統之狀態之錯誤診斷系統及方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7983490B1 (en) * 2007-12-20 2011-07-19 Thomas Cecil Minter Adaptive Bayes pattern recognition
CN106960154A (zh) * 2017-03-30 2017-07-18 兴华永恒(北京)科技有限责任公司 一种基于决策树模型的恶意程序动态识别方法
CN107798390A (zh) * 2017-11-22 2018-03-13 阿里巴巴集团控股有限公司 一种机器学习模型的训练方法、装置以及电子设备
CN109325525A (zh) * 2018-08-31 2019-02-12 阿里巴巴集团控股有限公司 样本属性评估模型训练方法、装置及服务器

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9053391B2 (en) * 2011-04-12 2015-06-09 Sharp Laboratories Of America, Inc. Supervised and semi-supervised online boosting algorithm in machine learning framework
US8838688B2 (en) * 2011-05-31 2014-09-16 International Business Machines Corporation Inferring user interests using social network correlation and attribute correlation
US9754215B2 (en) * 2012-12-17 2017-09-05 Sinoeast Concept Limited Question classification and feature mapping in a deep question answering system
CN105468742B (zh) * 2015-11-25 2018-11-20 小米科技有限责任公司 恶意订单识别方法及装置
CN107273454B (zh) * 2017-05-31 2020-11-03 北京京东尚科信息技术有限公司 用户数据分类方法、装置、服务器和计算机可读存储介质
CN107368892B (zh) * 2017-06-07 2020-06-16 无锡小天鹅电器有限公司 基于机器学习的模型训练方法和装置
CN107730262B (zh) * 2017-10-23 2021-09-24 创新先进技术有限公司 一种欺诈识别方法和装置
CN108334647A (zh) * 2018-04-12 2018-07-27 阿里巴巴集团控股有限公司 保险欺诈识别的数据处理方法、装置、设备及服务器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7983490B1 (en) * 2007-12-20 2011-07-19 Thomas Cecil Minter Adaptive Bayes pattern recognition
CN106960154A (zh) * 2017-03-30 2017-07-18 兴华永恒(北京)科技有限责任公司 一种基于决策树模型的恶意程序动态识别方法
CN107798390A (zh) * 2017-11-22 2018-03-13 阿里巴巴集团控股有限公司 一种机器学习模型的训练方法、装置以及电子设备
CN109325525A (zh) * 2018-08-31 2019-02-12 阿里巴巴集团控股有限公司 样本属性评估模型训练方法、装置及服务器

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709833A (zh) * 2020-06-16 2020-09-25 中国银行股份有限公司 用户信用的评估方法及装置
CN111709833B (zh) * 2020-06-16 2023-10-31 中国银行股份有限公司 用户信用的评估方法及装置
CN111931912A (zh) * 2020-08-07 2020-11-13 北京推想科技有限公司 网络模型的训练方法及装置,电子设备及存储介质
CN112231929A (zh) * 2020-11-02 2021-01-15 北京空间飞行器总体设计部 一种基于轨道参数的评估场景大样本生成方法
CN112231929B (zh) * 2020-11-02 2024-04-02 北京空间飞行器总体设计部 一种基于轨道参数的评估场景大样本生成方法
CN113343051A (zh) * 2021-06-04 2021-09-03 全球能源互联网研究院有限公司 一种异常sql检测模型构建方法及检测方法
CN113343051B (zh) * 2021-06-04 2024-04-16 全球能源互联网研究院有限公司 一种异常sql检测模型构建方法及检测方法
CN113779150A (zh) * 2021-09-14 2021-12-10 杭州数梦工场科技有限公司 一种数据质量评估方法及装置
CN116579651A (zh) * 2023-05-11 2023-08-11 中国矿业报社 一种基于半监督学习的矿业项目评价方法
CN116579651B (zh) * 2023-05-11 2023-11-10 中国矿业报社 一种基于半监督学习的矿业项目评价方法

Also Published As

Publication number Publication date
CN109325525A (zh) 2019-02-12
TWI726341B (zh) 2021-05-01
TW202011285A (zh) 2020-03-16

Similar Documents

Publication Publication Date Title
WO2020042795A1 (zh) 样本属性评估模型训练方法、装置及服务器
TWI712981B (zh) 風險辨識模型訓練方法、裝置及伺服器
AU2021200434B2 (en) Optimizing Neural Networks For Risk Assessment
WO2023065545A1 (zh) 风险预测方法、装置、设备及存储介质
CN107633265B (zh) 用于优化信用评估模型的数据处理方法及装置
US10346782B2 (en) Adaptive augmented decision engine
TW201923624A (zh) 一種資料樣本標籤處理方法及裝置
WO2019218748A1 (zh) 一种保险业务风险预测的处理方法、装置及处理设备
CN113722493B (zh) 文本分类的数据处理方法、设备、存储介质
CN112580733B (zh) 分类模型的训练方法、装置、设备以及存储介质
KR102144126B1 (ko) 기업을 위한 정보 제공 장치 및 방법
WO2020253038A1 (zh) 一种模型构建方法及装置
CN113221104B (zh) 用户异常行为的检测方法及用户行为重构模型的训练方法
WO2019019346A1 (zh) 资产配置策略获取方法、装置、计算机设备和存储介质
WO2023045691A1 (zh) 对象识别方法、装置、电子设备及存储介质
CN112257959A (zh) 用户风险预测方法、装置、电子设备及存储介质
CN113240177B (zh) 训练预测模型的方法、预测方法、装置、电子设备及介质
KR102502271B1 (ko) 인공지능을 기반으로 한 특허 평가 방법
CN114037052A (zh) 检测模型的训练方法、装置、电子设备及存储介质
US20210279824A1 (en) Property Valuation Model and Visualization
CN115952438B (zh) 社交平台用户属性预测方法、***、移动设备及存储介质
CN116307078A (zh) 账户标签预测方法、装置、存储介质及电子设备
KR102519878B1 (ko) 금융기관 신용공여 사업에서의 인공지능 기반 리스크 관리 솔루션을 제공하기 위한 장치, 방법 및 명령을 기록한 기록 매체
CN112364258B (zh) 基于图谱的推荐方法、***、存储介质及电子设备
CN115099366A (zh) 分类预测方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19855817

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19855817

Country of ref document: EP

Kind code of ref document: A1