WO2019184119A1 - Risk model training method and apparatus, risk identification method and apparatus, device, and medium - Google Patents

Risk model training method and apparatus, risk identification method and apparatus, device, and medium Download PDF

Info

Publication number
WO2019184119A1
WO2019184119A1 PCT/CN2018/094183 CN2018094183W WO2019184119A1 WO 2019184119 A1 WO2019184119 A1 WO 2019184119A1 CN 2018094183 W CN2018094183 W CN 2018094183W WO 2019184119 A1 WO2019184119 A1 WO 2019184119A1
Authority
WO
WIPO (PCT)
Prior art keywords
training data
original
risk
target
risk model
Prior art date
Application number
PCT/CN2018/094183
Other languages
French (fr)
Chinese (zh)
Inventor
金戈
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019184119A1 publication Critical patent/WO2019184119A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
    • G06Q10/083Shipping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/40Business processes related to the transportation industry

Definitions

  • the present application relates to the field of data prediction, and in particular, to a risk model training method, a risk identification method, a device, a device, and a medium.
  • the risk model based on the transportation industry is mainly used to identify the risk of transport objects, especially for training and identifying the crime risk of transport objects.
  • Existing factors based on the transportation industry's risk model have little effect on the model.
  • the existing risk model includes model factors such as travel time, travel location, gender, date of birth, and document type. The number of these model factors is small and the amount of relevant information containing risk is small, so that the risk model obtained by training only with these model factors has low recognition efficiency and the accuracy of risk model identification is not high.
  • the embodiment of the present application provides a risk model training method, a risk identification method, a device, a device and a medium, so as to solve the problem that the current risk model has low recognition efficiency and low accuracy.
  • the embodiment of the present application provides a risk model training method, including:
  • the historical travel data is marked with risk values to obtain the original training data
  • the decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
  • the original risk model is tested using a test set to obtain a target risk model.
  • the embodiment of the present application provides a risk model training apparatus, including:
  • the original training data acquisition module is configured to perform risk value labeling on historical travel data to obtain original training data
  • a target training data acquiring module configured to perform peer analysis and port drift analysis on the original training data, and obtain target training data
  • a target training data dividing module configured to split the target training data according to a preset time to obtain a training set and a test set
  • the original risk model acquisition module is configured to train the target training data in the training set by using a decision tree algorithm to obtain the original risk model
  • the target risk model acquisition module is used to test the original risk model by using the test set to obtain the target risk model.
  • An embodiment of the present application provides a computer device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the computer readable instructions The following steps:
  • the historical travel data is marked with risk values to obtain the original training data
  • the decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
  • the original risk model is tested using a test set to obtain a target risk model.
  • Embodiments of the present application provide one or more non-volatile readable storage media storing computer readable instructions, when executed by one or more processors, causing the one or more processors Perform the following steps:
  • the historical travel data is marked with risk values to obtain the original training data
  • the decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
  • the original risk model is tested using a test set to obtain a target risk model.
  • the embodiment of the present application provides a risk identification method, including:
  • the target risk model is a model obtained by using the risk model training method.
  • the embodiment of the present application provides a risk identification apparatus, including:
  • the data acquisition module to be identified is used to obtain the data to be identified
  • a risk identification result obtaining module configured to input the to-be-identified travel data into the target risk model for identification, and obtain a risk identification result
  • the target risk model is a model obtained by using the risk model training method.
  • An embodiment of the present application provides a computer device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the computer readable instructions The following steps:
  • the target risk model is a model obtained by using the risk model training method.
  • Embodiments of the present application provide one or more non-volatile readable storage media storing computer readable instructions, when executed by one or more processors, causing the one or more processors Perform the following steps:
  • the target risk model is a model obtained by using the risk model training method.
  • FIG. 1 is a flowchart of a risk model training method provided in Embodiment 1 of the present application.
  • FIG. 2 is a specific schematic view of step S12 of Figure 1;
  • FIG. 3 is a specific schematic view of step S121 of Figure 2;
  • FIG. 4 is a specific schematic view of step S122 of Figure 2;
  • FIG. 5 is a specific schematic view of step S14 of Figure 1;
  • FIG. 6 is a schematic block diagram of a risk model training apparatus provided in Embodiment 2 of the present application.
  • FIG. 7 is a flowchart of a risk identification method provided in Embodiment 3 of the present application.
  • FIG. 8 is a schematic block diagram of a risk identification device provided in Embodiment 4 of the present application.
  • FIG. 9 is a schematic diagram of a computer device provided in Embodiment 6 of the present application.
  • FIG. 1 shows a flow chart of a risk model training method in this embodiment.
  • the risk model training method can be applied to computer equipment of a judicial institution or other institutions, so as to utilize the trained risk model to identify transport objects (such as passengers) on the transport, which can effectively assist the business party in analyzing the risk of the transport object. Level to ensure the safety of other transport objects on the transport.
  • the risk model training method includes the following steps:
  • S11 Perform risk value labeling on historical travel data to obtain original training data.
  • the historical travel data is the travel data of the transport object obtained from the business party.
  • the historical travel data includes, but is not limited to, travel time, gender, age, check status, and travel location.
  • the original training data is the training data after the risk value is marked on the historical travel data.
  • the historical travel data includes historical travel data of a low risk object and historical travel data of a high risk object.
  • the risk value includes high risk value and low risk value, that is, low risk value labeling of historical travel data of low risk objects, high risk value labeling of historical travel data of high risk objects, to obtain original training data, each original training
  • the data includes historical travel data and its corresponding risk value.
  • S12 Perform peer analysis and port drift analysis on the original training data to obtain target training data.
  • the target training data is the data required for the model training.
  • Peer analysis refers to a specialized analysis of the behavioral characteristics of groups that act simultaneously with known high-risk subjects.
  • Port drift analysis is an analysis of whether a transport object will change the travel location within a certain period of time.
  • peer analysis and port drift analysis are performed on the original training data to acquire the feature factors (ie, target training data) required for model training, and provide technical support for subsequent model training.
  • S13 Split the target training data according to a preset time to obtain a training set and a test set.
  • the training set is a learning sample data set.
  • the classifier is built by matching some parameters, that is, the target training data in the training set is used to train the machine learning model to determine the parameters of the machine learning model.
  • the test set is used to test the resolving power of a trained machine learning model, such as recognition rate.
  • the preset time is a preset time for classifying the target training data.
  • the setting of the preset time includes, but is not limited to, obtaining according to historical experience or counting according to the travel time of the transport object in the original training data, and selecting the number of trips in the first n digits (n is a positive integer) The time interval to determine the preset time.
  • the travel time of the transport object is obtained from May to August, and the travel time is more, which is ranked in the first few of all travel time. Therefore, the time when the travel time is 5, 6, 7, or August is the preset time. Further, in order to ensure the predictive ability of the risk model in time, the target training data of the travel time in May and June is selected as the training set, and the target training data of the travel time in July and August is used as the test set.
  • the decision tree also called the decision tree, is a tree structure applied to the classification, in which each internal node represents a test of an attribute (ie, a dimensional feature), each edge represents a test result, and the leaf node Represents the distribution of a class or class.
  • the input to the decision tree construct is a set of examples with category tags, and the result of the construct is a binary tree or a multi-fork tree.
  • the edge is the result of the branch of logical judgment.
  • the decision tree algorithm can make feasible and effective results for large data sources in a relatively short period of time, which can improve the accuracy of the risk model, and the decision tree only needs to be constructed once and reused, which improves the efficiency of the risk model.
  • the target risk model is a model that uses the target training data in the test set to test the original risk model so that the accuracy of the original risk model reaches the preset accuracy.
  • the original risk model is tested by using the target training data in the test set to obtain the corresponding accuracy; if the accuracy reaches the preset accuracy, the original risk model is taken as the target risk model.
  • the historical training data is obtained by performing risk value labeling on the historical travel data, so that the original training data is subjected to peer analysis and port drift analysis to acquire the characteristic factor required for model training, that is, the target training data. Then, the target training data is split according to the preset time, and the training set and the test set are obtained, which ensures the prediction ability of the model in time.
  • the decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
  • the decision tree algorithm can make feasible and effective prediction results for large data sources in a relatively short time to improve the accuracy of the risk model. Rate, and the decision tree only needs to be built once and used repeatedly to improve the recognition efficiency of the risk model.
  • the original risk model is tested by using the test set to obtain the target risk model, and the accuracy of the risk model is further improved, so that the auxiliary recognition effect of the target risk model is better.
  • step S12 peer analysis and port drift analysis are performed on the original training data to obtain target training data, which specifically includes the following steps:
  • S121 Perform peer analysis on the original training data to obtain peer characteristics.
  • the peer feature is a feature obtained by peer analysis of the original training data. Since high-risk objects are known to be accomplices, and accomplices will act at the same time, peer-to-peer analysis can target high-risk user groups, effectively assisting the business to seize a large number of high-risk object accomplices.
  • S122 Perform port drift analysis on the original training data to obtain port drift characteristics.
  • the port drift feature is a feature obtained by port drift analysis of the original training data. According to the historical travel data of high-risk objects, the conclusion that high-risk objects generally do not change the travel location is obtained. Therefore, by analyzing the port drift of the original training data, the port drift characteristics can be used as the characteristic factor of the risk model. For example, the suspect will frequently commit crimes at the same place for a certain period of time. Therefore, the port drift analysis can effectively assist the business party in determining whether the transport object has high risk.
  • the peer characteristics obtained by peer analysis and the port drift characteristics obtained by port drift analysis are added as feature factors to the model training to obtain intermediate training data required for model training.
  • the peer feature and the port drift feature are added as feature factors to the risk model training, so that the recognition effect of the subsequent risk model based on the target training data acquisition is better.
  • S124 Perform missing value processing and discrete variable encoding on the intermediate training data to obtain target training data.
  • the missing value processing includes directly discarding the data if the missing value of the intermediate training data is large; if the missing value of the intermediate training data is small, the median is filled.
  • the missing value of the intermediate training data refers to the ratio of the number of characteristic factors of a transport object missing attribute value in the intermediate training data to all the characteristic factors corresponding to the transport object. For example, if the missing value of the characteristic factor (sex or age) of a transport object in the intermediate training data is greater than a preset threshold, the data is directly discarded; if the missing value is not greater than a preset threshold, the feature factor is taken.
  • the median of all the intermediate training data is filled in. For example, if the attribute value of the age feature of a transport object is missing, the median of the ages of all transport objects under the age feature in the intermediate training data is filled.
  • Discrete variable coding refers to encoding variables to make them easy to calculate.
  • the coding of the discrete variable gender is 0 (male) and 1 (female).
  • the target training data is obtained by performing missing value processing and discrete variable encoding on the intermediate training data to facilitate calculation and improve the efficiency of model training.
  • the target training data encoded by the discrete variable is also subjected to anomaly processing, wherein the abnormal value refers to the value of any feature (such as age, etc.) in the target training data. Outside the standard range (that is, greater than the standard range or less than the standard range), it is an outlier.
  • performing abnormal value processing on the target training data encoded by the discrete variable specifically includes: identifying whether the value of any feature in the target training data is an abnormal value, and if the abnormal value is, converting the attribute value of the feature into The value of the corresponding quantile is such that the target risk model obtained by the subsequent training based on the target training data is fault-tolerant.
  • an outlier (data too large or too small) processing method includes if the attribute value of a variable (gender or age) of a sample (ie, target training data) is greater than the 99th quantile of the variable, then the attribute of the variable The value is forced to be a value of 99 quantile; similarly, if the attribute value of a variable of a sample is less than the 1 quantile of the variable, the attribute value of the variable is forced to be specified as a 1 quantile.
  • Quantile also known as the quantile, refers to dividing the probability distribution range of a random variable into several equal-valued numerical points. The commonly used median (ie, binary), four points. Number of digits and percentiles, etc. That is, the quantile is the value of the variable at each of the equal positions after all the data of the whole (ie, the target training data) are arranged in ascending order.
  • peer analysis is performed on the original training data to obtain peer characteristics, so that the high-risk user group can be locked by peer analysis, and the auxiliary party can effectively assist the business party to seize a large number of high-risk object associates.
  • the travel location of the high-risk object is generally not changed, and the historical travel data of the high-risk object (ie, the original training data) is collected to obtain the port drift characteristic of the high-risk object.
  • the peer characteristics obtained by peer analysis and the port drift characteristics obtained by port drift analysis are added as feature factors to the model training to obtain intermediate training data.
  • the missing value processing and discrete variable coding are performed on the intermediate training data. Obtain target training data to facilitate calculation and improve the efficiency of risk model training.
  • step S121 peer analysis is performed on the original training data to obtain peer characteristics, which specifically includes the following steps:
  • all historical travel data marked with a high risk value ie, the original training data corresponding to the high risk value
  • the historical travel time in the historical travel data is counted, which is based on the history. Travel time, technical support provided by peer characteristics.
  • S1212 Interval division of historical travel time to obtain peer characteristics.
  • the historical travel time is divided into sections, that is, the time period during which the high-risk object is frequently traveled. For example, if the historical travel time of a high-risk object is concentrated in April and May, then April- May is used as the peer feature of the high-risk object, and technical support is provided for subsequent modeling of the peer feature as a feature factor.
  • all historical travel data marked with high-risk objects are first selected from the original training data, and the travel time in the historical travel data is counted. Then, the interval is divided according to the historical travel time, that is, statistically, during which period of time, the high-risk object frequently performs behavior and provides technical support for modeling the peer feature as a feature factor.
  • step S122 the port drift analysis is performed on the original training data to obtain the port drift feature, which specifically includes the following steps:
  • S1221 Count the number of trips and the number of place changes of the original training data of all high-risk values within a preset time.
  • the preset time is the same as the travel time of the peer feature, so that the peer feature and the port drift feature are correlated, and the accuracy of the model recognition is improved.
  • S value of the high-risk object it is the port drift feature, which is used as the decision threshold for determining the high-risk user.
  • the S value obtained by calculating the port drift characteristic of the high-risk object is generally greater than 1, so that it is known that the high-risk object generally does not change the travel location.
  • step S14 the decision tree algorithm is used to train the target training data in the training set to obtain the original risk model, which specifically includes the following steps:
  • the hierarchical parameter is the parameter of the maximum growth layer of the decision tree during the growth process, that is, the condition for stopping the splitting of the decision tree, so that the decision tree is no longer infinitely grown, so as to prevent the model from overfitting, reaching a relatively short time.
  • the purpose of making feasible and effective prediction results for large data sources is to improve the accuracy of model recognition.
  • S142 The target training data in the training set is trained by using the CART algorithm, and the original risk model is obtained when the growth layer of the decision tree reaches the hierarchical parameter.
  • the CART (Classification And Regression Tree) algorithm is an algorithm that uses the technique of two-way recursive segmentation to divide the current sample set into two sub-sample sets, so that the generated decision tree There are only two branches for each non-leaf node. Since the decision tree generated by the CART algorithm is a simple binary tree, the CART algorithm is applicable to scenarios where the sample features have a yes or no value. Specifically, the CART algorithm is used to train the target training data in the training set, that is, the growth process of the decision tree. Since the CART algorithm process includes a growth process and a pruning process, in the present embodiment, the tree growth is restricted by the initialized hierarchical parameters during the growth of the tree, so the pruning process in the CATR algorithm is not required.
  • step S142 the CART algorithm is used to train the target training data in the training set.
  • the original risk model is obtained, which specifically includes the following steps:
  • the calculation formula of the CART algorithm is with Calculating the Gini coefficient corresponding to the dimensional feature; where D is the training set, For the dimensional feature, such as the peer feature and the port drift feature in the embodiment, P k is the probability that the target training data in the training set belongs to the kth dimension feature, and D v represents the value of the dimension feature ⁇ in D. A collection of all samples for ⁇ v .
  • the dimension feature corresponding to the minimum Gini coefficient and the corresponding attribute value are selected as the optimal feature and the optimal segmentation point (ie, the optimal attribute value).
  • the optimal segmentation point ie, the optimal attribute value
  • S1423 The step of calculating the Gini coefficient corresponding to the dimension feature is repeatedly performed based on the root node of the decision tree, and the original risk model is obtained when the growth layer of the decision tree reaches the condition of the hierarchical parameter.
  • the root node of the decision tree divides the target training data into N parts, N depends on the number of attribute values of the root node, and then repeats the step of calculating the Gini coefficient corresponding to the dimension feature, that is, step S1421, and calculates the remaining dimension features.
  • the Gini coefficient under the action of the root node stops the growth process of the decision tree and obtains the original risk model until the growth layer of the decision tree reaches the condition of the hierarchical parameter.
  • the hierarchical parameters corresponding to the decision tree algorithm are initialized first, so that the decision tree is no longer infinitely grown, so as to prevent the model from over-fitting, and the feasible and effective result is obtained for a large data source in a relatively short time.
  • the purpose is to improve the accuracy of the model.
  • the CART algorithm is used to train the target training data in the training set, that is, the growth process of the decision tree.
  • the decision tree is growing, by calculating the Gini coefficient of each dimension feature, the dimensional features corresponding to the minimum Gini coefficient and the corresponding attribute values are selected as the optimal feature and the optimal segmentation point as the root node of the decision tree growth, and then continue. Iterate until the number of growth layers that satisfy the decision tree reaches the level parameter, and stop growing to obtain the original risk model.
  • the risk data is marked on the historical travel data to obtain the original training data, so that the original training data is peer-analyzed to obtain the peer characteristics, and the peer-to-peer analysis can lock the high-risk user group, effectively assisting the business party to obtain a large number of high The accomplices of the risk object.
  • the port drift characteristics of high-risk objects are obtained, that is, high-risk objects generally do not change the travel location.
  • the peer characteristics obtained by peer analysis and the port drift characteristics obtained by port drift analysis are added as feature factors to the model training to obtain the original training data, so that the risk model based on the original training data can be better recognized.
  • the original training data is processed by missing values and discrete variables to obtain target training data, which is convenient for calculation and improves the efficiency of risk model training.
  • the target training data is split according to the preset time, and the training set and the test set are obtained, which ensures the prediction ability of the model in time.
  • the decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
  • the decision tree algorithm can improve the accuracy of the risk model by making feasible and effective results for large data sources in a relatively short time, and The decision tree only needs to be constructed once and used repeatedly to improve the recognition efficiency of the risk model.
  • the original risk model is tested by using the test set to obtain the target risk model, and the accuracy of the risk model is further improved, so that the auxiliary effect of the target risk model is better.
  • Fig. 6 is a block diagram showing the principle of the risk model training device corresponding to the risk model training method of the first embodiment.
  • the risk model training device includes an original training data acquisition module 11, a target training data acquisition module 12, a target training data division module 13, an original risk model acquisition module 14, and a target risk model acquisition module 15.
  • the steps of the original training data obtaining module 11, the target training data acquiring module 12, the target training data dividing module 13, the original risk model obtaining module 14 and the target risk model obtaining module 15 and the risk model training method in the embodiment One-to-one correspondence, in order to avoid redundancy, this embodiment will not be described in detail.
  • the original training data obtaining module 11 is configured to perform risk value labeling on historical travel data to obtain original training data.
  • the target training data obtaining module 12 is configured to perform peer analysis and port drift analysis on the original training data to obtain target training data.
  • the target training data dividing module 13 is configured to split the target training data according to a preset time to obtain a training set and a test set.
  • the original risk model obtaining module 14 is configured to use the decision tree algorithm to train the target training data in the training set to obtain the original risk model.
  • the target risk model obtaining module 15 is configured to test the original risk model by using the test set to obtain the target risk model.
  • the target training data acquisition module 12 includes a peer feature acquisition unit 121, a port drift feature acquisition unit 122, an intermediate training data acquisition unit 123, and a target training data acquisition unit 124.
  • the peer feature acquiring unit 121 is configured to perform peer analysis on the original training data to acquire peer features.
  • the port drift feature acquiring unit 122 is configured to perform port drift analysis on the original training data to obtain a port drift feature.
  • the intermediate training data acquiring unit 123 is configured to acquire intermediate training data based on the peer feature and the port drift feature.
  • the target training data obtaining unit 124 is configured to perform missing value processing and discrete variable encoding on the intermediate training data to obtain target training data.
  • the peer feature acquisition unit 121 includes a historical travel time acquisition unit 1211 and a peer feature acquisition unit 1212.
  • the historical travel time obtaining unit 1211 is configured to acquire historical travel time corresponding to the original training data of all high risk values.
  • the peer feature acquisition unit 1212 is configured to divide the historical travel time into sections and acquire the peer features.
  • the port drift feature acquisition unit 122 includes an original training data statistics unit 1221 and a port drift feature acquisition 1222.
  • the original training data statistic unit 1221 is configured to count the number of trips and the number of location changes of the original training data of all high risk values within a preset time.
  • the original risk model acquisition module 14 includes an algorithm parameter initial unit 141 and an original risk model acquisition unit 142.
  • the algorithm parameter initial unit 141 is configured to initialize a hierarchical parameter corresponding to the decision tree algorithm.
  • the original risk model obtaining unit 142 is configured to train the target training data in the training set by using the CART algorithm, and obtain the original risk model when the number of growing layers of the decision tree reaches the hierarchical parameter.
  • the target training data includes at least two dimensional features.
  • the original risk model acquisition unit 142 includes a Gini coefficient acquisition sub-unit 1421, a root node acquisition sub-unit 1422, and an original risk model acquisition sub-unit 1423.
  • the Gini coefficient acquisition sub-unit 1421 is used to adopt a formula with Calculate the Gini coefficient corresponding to each dimension feature.
  • D is the training set
  • is the dimensional feature
  • P k is the probability of the dimensional feature.
  • the root node obtaining sub-unit 1422 is configured to select a dimension feature corresponding to the minimum Gini coefficient as a root node of the decision tree.
  • the original risk model obtaining sub-unit 1423 is configured to repeatedly perform the step of calculating the Gini coefficient corresponding to the dimension feature based on the root node of the decision tree, and obtain the original risk model when the number of growth layers of the decision tree reaches the condition of the hierarchical parameter.
  • Fig. 7 is a flow chart showing the risk identification method in this embodiment.
  • the risk identification method can be applied to computer equipment of a judicial institution or other institutions to check the historical travel data of the transport object, so as to achieve the purpose of the auxiliary business side analyzing the risk level of the transport object.
  • the risk model training method includes the following steps:
  • the data to be identified refers to the behavior data collected by the transportation object in real time for identifying whether there is any risk.
  • the travel data to be identified includes, but is not limited to, the travel time of the transport object, the travel location and the inspection situation, and the basic characteristics of the transport object itself (for example, gender and age).
  • the inspection situation refers to a situation in which the risk object is checked for risk before the risk identification of the transportation object.
  • S22 Input the data to be identified into the target risk model for identification, and obtain the risk identification result.
  • the target risk model is a model obtained by using the risk model training method of Embodiment 1, and the target risk model is used to identify the identification data, so that the accuracy of the risk identification result is higher.
  • the line data to be identified is input into the target risk model for identification, and the input line data to be identified is determined in the target risk model and the risk identification result is output. Specifically, after acquiring the data to be recognized of the transportation object A, the computer device makes a decision in the target risk model and outputs the recognition result.
  • the line data to be identified is obtained, so that the line data to be identified is input into the target risk model for identification, and the risk identification result is obtained, so that the accuracy of the recognition is higher, and the trip is more accurately treated.
  • Data is identified to assist the business to quickly target high-risk users in order to take timely action.
  • Fig. 8 is a block diagram showing the principle of the risk identification device corresponding to the risk identification method in the third embodiment.
  • the risk identification device includes a line data acquisition module 21 to be identified and a risk identification result acquisition module 22.
  • the implementation functions of the to-be-identified travel data acquisition module 21 and the risk identification result acquisition module 22 correspond to the steps corresponding to the risk identification method in the third embodiment. To avoid redundancy, the present embodiment will not be described in detail.
  • the travel data acquisition module 21 is configured to obtain the data to be recognized.
  • the risk identification result obtaining module 22 is configured to input the line data to be identified into the target risk model for identification, and obtain the risk identification result;
  • the target risk model is a model obtained by using the risk model training method in Embodiment 1.
  • the embodiment provides one or more non-volatile readable storage media having computer readable instructions that, when executed by one or more processors, cause the one or more processors to execute
  • the risk model training method in Embodiment 1 is implemented. To avoid repetition, details are not described herein again.
  • Embodiment 3 when the computer readable instructions are executed by one or more processors, when the one or more processors are executed, the risk identification method in Embodiment 3 is implemented. To avoid repetition, no further details are provided herein;
  • the computer readable instructions are executed by one or more processors such that when executed by the one or more processors, the functions of the modules/units in the risk identification device of Embodiment 4 are implemented, to avoid repetition, here No longer.
  • FIG. 9 is a schematic diagram of a computer device according to an embodiment of the present application.
  • computer device 90 of this embodiment includes a processor 91, a memory 92, and computer readable instructions 93 stored in memory 92 and executable on processor 91.
  • the processor 91 executes the computer readable instructions 93, the steps of the various methods of the risk model training in the above Embodiment 1 are implemented. To avoid repetition, details are not described herein.
  • the processor 91 executes the computer readable instructions 93, the functions of the modules/units of the risk model training device in the second embodiment are implemented.
  • the details are not described herein; or the processor 91 executes the computer readable
  • the steps of the risk identification method in the foregoing embodiment 3 are implemented when the instruction 93 is implemented.
  • the details are not described herein.
  • the processor 91 executes the computer readable instructions 93
  • the modules of the risk identification device in the above embodiment 4 are implemented. / The function of the unit, in order to avoid duplication, we will not repeat them here.

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Tourism & Hospitality (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Disclosed in the present application are a risk model training method and apparatus, a risk identification method and apparatus, a device, and a medium. The risk model training method comprises: marking risk values on past travel data to obtain original training data; performing simultaneous action analysis and port drift analysis on the original training data to obtain target training data; splitting the target training data according to a preset time to obtain a training set and a test set; training the target training data in the training set by means of a decision tree algorithm to obtain an original risk model; and testing the original risk model by means of the test set to obtain a target risk module. The risk model training method effectively solves the problems of low identification efficiency of an existing risk model and poor accuracy of the model.

Description

风险模型训练方法、风险识别方法、装置、设备及介质Risk model training method, risk identification method, device, device and medium
本专利申请以2018年3月26日提交的申请号为201810250156.3,名称为“风险模型训练方法、风险识别方法、装置、设备及介质”的中国发明专利申请为基础,并要求其优先权。This patent application is based on the Chinese invention patent application filed on March 26, 2018, having the application number 201810250156.3, entitled "Risk Model Training Method, Risk Identification Method, Apparatus, Equipment, and Medium", and requires its priority.
技术领域Technical field
本申请涉及数据预测领域,尤其涉及一种风险模型训练方法、风险识别方法、装置、设备及介质。The present application relates to the field of data prediction, and in particular, to a risk model training method, a risk identification method, a device, a device, and a medium.
背景技术Background technique
目前,基于运输业的风险模型主要用于识别运输对象的风险,尤其是用于训练和识别运输对象的犯罪风险。现有基于运输业的风险模型的因子对模型影响不大。例如:现有风险模型包括运输对象的出行时间点、出行地点、性别、出生年月和证件类型等模型因子。这些模型因子的数量较少且蕴含风险的相关信息量较少,使得只采用这些模型因子进行训练所得到的风险模型的识别效率低且风险模型识别的准确率不高。At present, the risk model based on the transportation industry is mainly used to identify the risk of transport objects, especially for training and identifying the crime risk of transport objects. Existing factors based on the transportation industry's risk model have little effect on the model. For example, the existing risk model includes model factors such as travel time, travel location, gender, date of birth, and document type. The number of these model factors is small and the amount of relevant information containing risk is small, so that the risk model obtained by training only with these model factors has low recognition efficiency and the accuracy of risk model identification is not high.
发明内容Summary of the invention
本申请实施例提供一种风险模型训练方法、风险识别方法、装置、设备及介质,以解决当前风险模型的识别效率低和准确率不高的问题。The embodiment of the present application provides a risk model training method, a risk identification method, a device, a device and a medium, so as to solve the problem that the current risk model has low recognition efficiency and low accuracy.
本申请实施例提供一种风险模型训练方法,包括:The embodiment of the present application provides a risk model training method, including:
对历史出行数据进行风险值标注,获取原始训练数据;The historical travel data is marked with risk values to obtain the original training data;
对所述原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据;Perform peer analysis and port drift analysis on the original training data to obtain target training data;
按照预设时间对所述目标训练数据进行拆分,获取训练集和测试集;Performing splitting of the target training data according to a preset time to obtain a training set and a test set;
采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型;The decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
采用测试集对原始风险模型进行测试,获取目标风险模型。The original risk model is tested using a test set to obtain a target risk model.
本申请实施例提供一种风险模型训练装置,包括:The embodiment of the present application provides a risk model training apparatus, including:
原始训练数据获取模块,用于对历史出行数据进行风险值标注,获取原始训练数据;The original training data acquisition module is configured to perform risk value labeling on historical travel data to obtain original training data;
目标训练数据获取模块,用于对所述原始训练数据进行同行分析和口岸漂移分析,获 取目标训练数据;a target training data acquiring module, configured to perform peer analysis and port drift analysis on the original training data, and obtain target training data;
目标训练数据划分模块,用于按照预设时间对所述目标训练数据进行拆分,获取训练集和测试集;a target training data dividing module, configured to split the target training data according to a preset time to obtain a training set and a test set;
原始风险模型获取模块,用于采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型;The original risk model acquisition module is configured to train the target training data in the training set by using a decision tree algorithm to obtain the original risk model;
目标风险模型获取模块,用于采用测试集对原始风险模型进行测试,获取目标风险模型。The target risk model acquisition module is used to test the original risk model by using the test set to obtain the target risk model.
本申请实施例提供一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:An embodiment of the present application provides a computer device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the computer readable instructions The following steps:
对历史出行数据进行风险值标注,获取原始训练数据;The historical travel data is marked with risk values to obtain the original training data;
对所述原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据;Perform peer analysis and port drift analysis on the original training data to obtain target training data;
按照预设时间对所述目标训练数据进行拆分,获取训练集和测试集;Performing splitting of the target training data according to a preset time to obtain a training set and a test set;
采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型;The decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
采用测试集对所述原始风险模型进行测试,获取目标风险模型。The original risk model is tested using a test set to obtain a target risk model.
本申请实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:Embodiments of the present application provide one or more non-volatile readable storage media storing computer readable instructions, when executed by one or more processors, causing the one or more processors Perform the following steps:
对历史出行数据进行风险值标注,获取原始训练数据;The historical travel data is marked with risk values to obtain the original training data;
对所述原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据;Perform peer analysis and port drift analysis on the original training data to obtain target training data;
按照预设时间对所述目标训练数据进行拆分,获取训练集和测试集;Performing splitting of the target training data according to a preset time to obtain a training set and a test set;
采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型;The decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
采用测试集对所述原始风险模型进行测试,获取目标风险模型。The original risk model is tested using a test set to obtain a target risk model.
本申请实施例提供一种风险识别方法,包括:The embodiment of the present application provides a risk identification method, including:
获取待识别出行数据;Obtaining the travel data to be identified;
将所述待识别出行数据输入到所述目标风险模型进行识别,获取风险识别结果;Inputting the to-be-identified travel data into the target risk model for identification, and acquiring a risk identification result;
其中,所述目标风险模型是采用所述风险模型训练方法获取的模型。The target risk model is a model obtained by using the risk model training method.
本申请实施例提供一种风险识别装置,包括:The embodiment of the present application provides a risk identification apparatus, including:
待识别出行数据获取模块,用于获取待识别出行数据;The data acquisition module to be identified is used to obtain the data to be identified;
风险识别结果获取模块,用于将所述待识别出行数据输入到所述目标风险模型进行识别,获取风险识别结果;a risk identification result obtaining module, configured to input the to-be-identified travel data into the target risk model for identification, and obtain a risk identification result;
其中,所述目标风险模型是采用所述风险模型训练方法获取的模型。The target risk model is a model obtained by using the risk model training method.
本申请实施例提供一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:An embodiment of the present application provides a computer device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the computer readable instructions The following steps:
获取待识别出行数据;Obtaining the travel data to be identified;
将所述待识别出行数据输入到所述目标风险模型进行识别,获取风险识别结果;Inputting the to-be-identified travel data into the target risk model for identification, and acquiring a risk identification result;
其中,所述目标风险模型是采用所述风险模型训练方法获取的模型。The target risk model is a model obtained by using the risk model training method.
本申请实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:Embodiments of the present application provide one or more non-volatile readable storage media storing computer readable instructions, when executed by one or more processors, causing the one or more processors Perform the following steps:
获取待识别出行数据;Obtaining the travel data to be identified;
将所述待识别出行数据输入到所述目标风险模型进行识别,获取风险识别结果;Inputting the to-be-identified travel data into the target risk model for identification, and acquiring a risk identification result;
其中,所述目标风险模型是采用所述风险模型训练方法获取的模型。The target risk model is a model obtained by using the risk model training method.
本申请的一个或多个实施例的细节在下面的附图及描述中提出。本申请的其他特征和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the present invention will be apparent from the description, drawings and claims.
附图说明DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application. Other drawings may also be obtained from those of ordinary skill in the art based on these drawings without the inventive labor.
图1是本申请实施例1中提供的风险模型训练方法的一流程图;1 is a flowchart of a risk model training method provided in Embodiment 1 of the present application;
图2是图1中步骤S12的一具体示意图;Figure 2 is a specific schematic view of step S12 of Figure 1;
图3是图2中步骤S121的一具体示意图;Figure 3 is a specific schematic view of step S121 of Figure 2;
图4是图2中步骤S122的一具体示意图;Figure 4 is a specific schematic view of step S122 of Figure 2;
图5是图1中步骤S14的一具体示意图;Figure 5 is a specific schematic view of step S14 of Figure 1;
图6是本申请实施例2中提供的风险模型训练装置的一原理框图;6 is a schematic block diagram of a risk model training apparatus provided in Embodiment 2 of the present application;
图7是本申请实施例3中提供的风险识别方法的一流程图;7 is a flowchart of a risk identification method provided in Embodiment 3 of the present application;
图8是本申请实施例4中提供的风险识别装置的一原理框图;8 is a schematic block diagram of a risk identification device provided in Embodiment 4 of the present application;
图9是本申请实施例6中提供的计算机设备的一示意图。FIG. 9 is a schematic diagram of a computer device provided in Embodiment 6 of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
实施例1Example 1
图1示出本实施例中风险模型训练方法的流程图。该风险模型训练方法可应用在司法机构或其他机构的计算机设备上,以便利用该训练好的风险模型对运输工具上的运输对象(如乘客)进行识别,能够有效辅助业务方分析运输对象的风险等级,以保证运输工具上其他运输对象的安全。如图1所示,该风险模型训练方法包括如下步骤:FIG. 1 shows a flow chart of a risk model training method in this embodiment. The risk model training method can be applied to computer equipment of a judicial institution or other institutions, so as to utilize the trained risk model to identify transport objects (such as passengers) on the transport, which can effectively assist the business party in analyzing the risk of the transport object. Level to ensure the safety of other transport objects on the transport. As shown in FIG. 1, the risk model training method includes the following steps:
S11:对历史出行数据进行风险值标注,获取原始训练数据。S11: Perform risk value labeling on historical travel data to obtain original training data.
其中,历史出行数据是从业务方获取到的运输对象的出行数据。该历史出行数据包括但不限于出行时间、性别、年龄、查验情况和出行地点等。原始训练数据是对历史出行数据进行风险值标注后的训练数据。本实施例中,历史出行数据包括低风险对象的历史出行数据和高风险对象的历史出行数据。风险值包括高风险值和低风险值,即对低风险对象的历史出行数据进行低风险值标注,对高风险对象的历史出行数据进行高风险值标注,以获取原始训练数据,每一原始训练数据包括历史出行数据和其对应的风险值。The historical travel data is the travel data of the transport object obtained from the business party. The historical travel data includes, but is not limited to, travel time, gender, age, check status, and travel location. The original training data is the training data after the risk value is marked on the historical travel data. In this embodiment, the historical travel data includes historical travel data of a low risk object and historical travel data of a high risk object. The risk value includes high risk value and low risk value, that is, low risk value labeling of historical travel data of low risk objects, high risk value labeling of historical travel data of high risk objects, to obtain original training data, each original training The data includes historical travel data and its corresponding risk value.
S12:对原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据。S12: Perform peer analysis and port drift analysis on the original training data to obtain target training data.
其中,目标训练数据是用于进行模型训练所需的数据。同行分析指对与已知高风险对象同时行动的群体的行为特征进行的专门分析。口岸漂移分析是对运输对象在一定时间内是否会更改出行地点的分析。本实施例中,通过对原始训练数据进行同行分析和口岸漂移分析,以获取模型训练所需的特征因子(即目标训练数据),为后续进行模型训练提供技术支持。Among them, the target training data is the data required for the model training. Peer analysis refers to a specialized analysis of the behavioral characteristics of groups that act simultaneously with known high-risk subjects. Port drift analysis is an analysis of whether a transport object will change the travel location within a certain period of time. In this embodiment, peer analysis and port drift analysis are performed on the original training data to acquire the feature factors (ie, target training data) required for model training, and provide technical support for subsequent model training.
S13:按照预设时间对目标训练数据进行拆分,获取训练集和测试集。S13: Split the target training data according to a preset time to obtain a training set and a test set.
其中,训练集(training set)是学习样本数据集,是通过匹配一些参数来建立分类器,即采用训练集中的目标训练数据来训练机器学习模型,以确定机器学习模型的参数。测试集(test set)是用于测试训练好的机器学习模型的分辨能力,如识别率。预设时间是预先设定好的,用于对目标训练数据进行分类的时间。本实施例中,该预设时间的设定包括但不限于是根据历史经验获取或者根据原始训练数据中的运输对象的出行时间进行统计,选取出行次数排在前n位(n为正整数)的时间区间,以确定预设时间,例如,例 如,根据历史数据中的出行时间统计,得到运输对象的出行时间在5-8月份时,出行次数较多,排在所有出行时间的前几位,因此,选取出行时间为5、6、7、8月份的时间为预设时间。进一步地,为了保证风险模型在时间上的预测能力,选取出行时间在5、6月份的目标训练数据作为训练集,出行时间在7、8月份的目标训练数据作为测试集。The training set is a learning sample data set. The classifier is built by matching some parameters, that is, the target training data in the training set is used to train the machine learning model to determine the parameters of the machine learning model. The test set is used to test the resolving power of a trained machine learning model, such as recognition rate. The preset time is a preset time for classifying the target training data. In this embodiment, the setting of the preset time includes, but is not limited to, obtaining according to historical experience or counting according to the travel time of the transport object in the original training data, and selecting the number of trips in the first n digits (n is a positive integer) The time interval to determine the preset time. For example, according to the travel time statistics in the historical data, the travel time of the transport object is obtained from May to August, and the travel time is more, which is ranked in the first few of all travel time. Therefore, the time when the travel time is 5, 6, 7, or August is the preset time. Further, in order to ensure the predictive ability of the risk model in time, the target training data of the travel time in May and June is selected as the training set, and the target training data of the travel time in July and August is used as the test set.
S14:采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型。S14: Using the decision tree algorithm to train the target training data in the training set to obtain the original risk model.
其中,决策树又称为判定树,是运用于分类的一种树结构,其中的每个内部节点代表对某一属性(即维度特征)的一次测试,每条边代表一个测试结果,叶节点代表某个类或类的分布。决策树构造的输入是一组带有类别标记的例子,构造的结果是一棵二叉树或多叉树。二叉树的内部节点(非叶子节点)一般表示为一个逻辑判断,如形式为a=a j的逻辑判断,其中a是特征因子,a j(属性值)是该特征因子的所有取值,树的边是逻辑判断的分支结果。决策树算法可以在相对短的时间内对大型数据源做出可行且效果良好的结果,可以提高风险模型的准确率,并且决策树只需要一次构建,反复使用,提高了风险模型的效率。 The decision tree, also called the decision tree, is a tree structure applied to the classification, in which each internal node represents a test of an attribute (ie, a dimensional feature), each edge represents a test result, and the leaf node Represents the distribution of a class or class. The input to the decision tree construct is a set of examples with category tags, and the result of the construct is a binary tree or a multi-fork tree. The internal nodes (non-leaf nodes) of the binary tree are generally represented as a logical judgment, such as a logical judgment of the form a=a j , where a is a feature factor, a j (attribute value) is all values of the feature factor, and the tree The edge is the result of the branch of logical judgment. The decision tree algorithm can make feasible and effective results for large data sources in a relatively short period of time, which can improve the accuracy of the risk model, and the decision tree only needs to be constructed once and reused, which improves the efficiency of the risk model.
S15:采用测试集对原始风险模型进行测试,获取目标风险模型。S15: Test the original risk model with a test set to obtain a target risk model.
其中,目标风险模型是采用测试集中的目标训练数据对原始风险模型进行测试,以使原始风险模型的准确度达到预设准确度的模型。具体地,采用测试集中的目标训练数据对原始风险模型进行测试,以获取对应的准确度;若准确度达到预设准确度,则将该原始风险模型作为目标风险模型。Among them, the target risk model is a model that uses the target training data in the test set to test the original risk model so that the accuracy of the original risk model reaches the preset accuracy. Specifically, the original risk model is tested by using the target training data in the test set to obtain the corresponding accuracy; if the accuracy reaches the preset accuracy, the original risk model is taken as the target risk model.
本实施例中,通过对历史出行数据进行风险值标注,获取原始训练数据,以便对原始训练数据进行同行分析和口岸漂移分析获取模型训练所需的特征因子即目标训练数据。然后,按照预设时间对目标训练数据进行拆分,获取训练集和测试集,保证了模型在时间上的预测能力。采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型,该决策树算法可以在相对短的时间内对大型数据源做出可行且效果良好的预测结果,以提高风险模型的准确率,并且决策树只需要一次构建,反复使用,提高风险模型的识别效率。最后,采用测试集对原始风险模型进行测试,获取目标风险模型,进一步提高风险模型的准确率,以使目标风险模型的辅助识别效果更佳。In this embodiment, the historical training data is obtained by performing risk value labeling on the historical travel data, so that the original training data is subjected to peer analysis and port drift analysis to acquire the characteristic factor required for model training, that is, the target training data. Then, the target training data is split according to the preset time, and the training set and the test set are obtained, which ensures the prediction ability of the model in time. The decision tree algorithm is used to train the target training data in the training set to obtain the original risk model. The decision tree algorithm can make feasible and effective prediction results for large data sources in a relatively short time to improve the accuracy of the risk model. Rate, and the decision tree only needs to be built once and used repeatedly to improve the recognition efficiency of the risk model. Finally, the original risk model is tested by using the test set to obtain the target risk model, and the accuracy of the risk model is further improved, so that the auxiliary recognition effect of the target risk model is better.
在一具体实施方式中,如图2所示,步骤S12中,即对原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据,具体包括如下步骤:In a specific implementation, as shown in FIG. 2, in step S12, peer analysis and port drift analysis are performed on the original training data to obtain target training data, which specifically includes the following steps:
S121:对原始训练数据进行同行分析,获取同行特征。S121: Perform peer analysis on the original training data to obtain peer characteristics.
其中,同行特征是通过对原始训练数据进行同行分析所得到的特征。由于已知高风险 对象是有同伙的,而同伙都会同时行动,通过同行分析能够锁定高风险用户群体,有效辅助业务方查获大量的高风险对象的同伙。Among them, the peer feature is a feature obtained by peer analysis of the original training data. Since high-risk objects are known to be accomplices, and accomplices will act at the same time, peer-to-peer analysis can target high-risk user groups, effectively assisting the business to seize a large number of high-risk object accomplices.
S122:对原始训练数据进行口岸漂移分析,获取口岸漂移特征。S122: Perform port drift analysis on the original training data to obtain port drift characteristics.
其中,口岸漂移特征是通过对原始训练数据进行口岸漂移分析所得到的特征。根据统计高风险对象的历史出行数据,得到高风险对象一般不会更改出行地点的结论,因此通过对原始训练数据进行口岸漂移分析,获取其口岸漂移特征可以作为风险模型的特征因子。例如,嫌疑人会在某一段时间内频繁在同一地点进行作案,因此,通过口岸漂移分析能有效辅助业务方判定运输对象是否具有高风险。Among them, the port drift feature is a feature obtained by port drift analysis of the original training data. According to the historical travel data of high-risk objects, the conclusion that high-risk objects generally do not change the travel location is obtained. Therefore, by analyzing the port drift of the original training data, the port drift characteristics can be used as the characteristic factor of the risk model. For example, the suspect will frequently commit crimes at the same place for a certain period of time. Therefore, the port drift analysis can effectively assist the business party in determining whether the transport object has high risk.
S123:基于同行特征和口岸漂移特征,获取中间训练数据。S123: Obtain intermediate training data based on peer characteristics and port drift characteristics.
具体地,将通过同行分析得到的同行特征和通过口岸漂移分析得到的口岸漂移特征作为特征因子加入到模型训练中,以获取模型训练所需的中间训练数据。本实施例中,通过将同行特征和口岸漂移特征作为特征因子加入到风险模型训练中,以使后续基于目标训练数据获取的风险模型的识别效果更佳。Specifically, the peer characteristics obtained by peer analysis and the port drift characteristics obtained by port drift analysis are added as feature factors to the model training to obtain intermediate training data required for model training. In this embodiment, the peer feature and the port drift feature are added as feature factors to the risk model training, so that the recognition effect of the subsequent risk model based on the target training data acquisition is better.
S124:对中间训练数据进行缺失值处理和离散变量编码,获取目标训练数据。S124: Perform missing value processing and discrete variable encoding on the intermediate training data to obtain target training data.
其中,缺失值处理包括若中间训练数据的缺失值较大时,则直接抛弃该数据;若中间训练数据的缺失值较小时则取中位数进行填写。其中,中间训练数据的缺失值是指中间训练数据中某运输对象缺失属性值的特征因子的数量与该运输对象对应的全部特征因子的比例。例如,若中间训练数据中的某运输对象的特征因子(性别或年龄)的缺失值大于预设阈值时,则直接抛弃该数据;若其缺失值不大于预设阈值,则取该特征因子下对应的所有中间训练数据的中位数进行填写,例如若某运输对象的年龄特征的属性值缺失,则取中间训练数据中年龄特征下所有运输对象的年龄的中位数进行填写。The missing value processing includes directly discarding the data if the missing value of the intermediate training data is large; if the missing value of the intermediate training data is small, the median is filled. The missing value of the intermediate training data refers to the ratio of the number of characteristic factors of a transport object missing attribute value in the intermediate training data to all the characteristic factors corresponding to the transport object. For example, if the missing value of the characteristic factor (sex or age) of a transport object in the intermediate training data is greater than a preset threshold, the data is directly discarded; if the missing value is not greater than a preset threshold, the feature factor is taken. The median of all the intermediate training data is filled in. For example, if the attribute value of the age feature of a transport object is missing, the median of the ages of all transport objects under the age feature in the intermediate training data is filled.
离散变量编码是指对变量进行编码,使其容易计算。例如对于离散变量性别的编码为0(男)和1(女)。本实施例中,通过对中间训练数据进行缺失值处理和离散变量编码,获取目标训练数据,以方便计算,提高模型训练的效率。Discrete variable coding refers to encoding variables to make them easy to calculate. For example, the coding of the discrete variable gender is 0 (male) and 1 (female). In this embodiment, the target training data is obtained by performing missing value processing and discrete variable encoding on the intermediate training data to facilitate calculation and improve the efficiency of model training.
进一步地,对中间训练数据进行离散变量编码之后,还会对离散变量编码后的目标训练数据进行异常值处理,其中,异常值是指目标训练数据中任一特征(如年龄等)的数值在标准范围之外(即大于标准范围或小于标准范围),则其为异常值。本实施例中,对离散变量编码后的目标训练数据进行异常值处理具体包括:识别目标训练数据中任一特征的数值是否为异常值,若为异常值,则将该特征的属性值转换成对应的分位数的值,以使后续基于目标训练数据进行训练所获取的目标风险模型具有容错性。例如,异常值(数据过 大或过小)处理方法包括如果一个样本(即目标训练数据)的某变量(性别或年龄)的属性值大于该变量的99分位数,则将该变量的属性值被强制指定为99分位数的值;类似的,如果一个样本的某变量的属性值小于该变量的1分位数,则该变量的属性值被强制指定为1分位数。其中,分位数(Quantile),亦称分位点,是指将一个随机变量的概率分布范围分为几个等份的数值点,常用的有中位数(即二分位数)、四分位数和百分位数等。即分位数是将总体的全部数据(即目标训练数据)按从小到大顺序排列后,处于各等分位置的变量值。Further, after the intermediate training data is subjected to discrete variable encoding, the target training data encoded by the discrete variable is also subjected to anomaly processing, wherein the abnormal value refers to the value of any feature (such as age, etc.) in the target training data. Outside the standard range (that is, greater than the standard range or less than the standard range), it is an outlier. In this embodiment, performing abnormal value processing on the target training data encoded by the discrete variable specifically includes: identifying whether the value of any feature in the target training data is an abnormal value, and if the abnormal value is, converting the attribute value of the feature into The value of the corresponding quantile is such that the target risk model obtained by the subsequent training based on the target training data is fault-tolerant. For example, an outlier (data too large or too small) processing method includes if the attribute value of a variable (gender or age) of a sample (ie, target training data) is greater than the 99th quantile of the variable, then the attribute of the variable The value is forced to be a value of 99 quantile; similarly, if the attribute value of a variable of a sample is less than the 1 quantile of the variable, the attribute value of the variable is forced to be specified as a 1 quantile. Among them, Quantile, also known as the quantile, refers to dividing the probability distribution range of a random variable into several equal-valued numerical points. The commonly used median (ie, binary), four points. Number of digits and percentiles, etc. That is, the quantile is the value of the variable at each of the equal positions after all the data of the whole (ie, the target training data) are arranged in ascending order.
本实施例中,先对原始训练数据进行同行分析,获取同行特征,以便通过同行分析能够锁定高风险用户群体,有效辅助业务方查获大量的高风险对象的同伙。依据高风险对象一般不会更改出行地点,通过统计高风险对象的历史出行数据(即原始训练数据),以获取高风险对象的口岸漂移特征。然后,将通过同行分析得到的同行特征和通过口岸漂移分析得到的口岸漂移特征作为特征因子加入到模型训练中,以获取中间训练数据,最后,对中间训练数据进行缺失值处理和离散变量编码,获取目标训练数据,方便计算,提高风险模型训练的效率。In this embodiment, peer analysis is performed on the original training data to obtain peer characteristics, so that the high-risk user group can be locked by peer analysis, and the auxiliary party can effectively assist the business party to seize a large number of high-risk object associates. According to the high-risk object, the travel location of the high-risk object is generally not changed, and the historical travel data of the high-risk object (ie, the original training data) is collected to obtain the port drift characteristic of the high-risk object. Then, the peer characteristics obtained by peer analysis and the port drift characteristics obtained by port drift analysis are added as feature factors to the model training to obtain intermediate training data. Finally, the missing value processing and discrete variable coding are performed on the intermediate training data. Obtain target training data to facilitate calculation and improve the efficiency of risk model training.
在一具体实施方式中,如图3所示,步骤S121中,即对原始训练数据进行同行分析,获取同行特征,具体包括如下步骤:In a specific implementation, as shown in FIG. 3, in step S121, peer analysis is performed on the original training data to obtain peer characteristics, which specifically includes the following steps:
S1211:获取所有高风险值的原始训练数据对应的历史出行时间。S1211: Obtain historical travel time corresponding to the original training data of all high risk values.
具体地,从原始训练数据中选取所有标注有高风险值的历史出行数据(即高风险值对应的原始训练数据),并对该历史出行数据中的历史出行时间进行统计,为后续基于该历史出行时间,获取同行特征提供技术支持。Specifically, all historical travel data marked with a high risk value (ie, the original training data corresponding to the high risk value) are selected from the original training data, and the historical travel time in the historical travel data is counted, which is based on the history. Travel time, technical support provided by peer characteristics.
S1212:对历史出行时间进行区间划分,获取同行特征。S1212: Interval division of historical travel time to obtain peer characteristics.
具体地,对历史出行时间进行区间划分,即统计高风险对象在哪一段时间内频繁出行。例如,若某高风险对象的历史出行时间集中在4月和5月,则将4-5月作为该高风险对象的同行特征,为后续将同行特征作为特征因子进行建模提供技术支持。Specifically, the historical travel time is divided into sections, that is, the time period during which the high-risk object is frequently traveled. For example, if the historical travel time of a high-risk object is concentrated in April and May, then April-May is used as the peer feature of the high-risk object, and technical support is provided for subsequent modeling of the peer feature as a feature factor.
本实施例中,先从原始训练数据中选取所有标注有高风险对象的历史出行数据,并对该历史出行数据中的出行时间进行统计。然后,按照历史出行时间进行区间划分,即统计高风险对象在哪一段时间内频繁出行为后续将同行特征作为特征因子进行建模提供技术支持。In this embodiment, all historical travel data marked with high-risk objects are first selected from the original training data, and the travel time in the historical travel data is counted. Then, the interval is divided according to the historical travel time, that is, statistically, during which period of time, the high-risk object frequently performs behavior and provides technical support for modeling the peer feature as a feature factor.
在一具体实施方式中,如图4所示,步骤S122中,即对原始训练数据进行口岸漂移分析,获取口岸漂移特征,具体包括如下步骤:In a specific implementation, as shown in FIG. 4, in step S122, the port drift analysis is performed on the original training data to obtain the port drift feature, which specifically includes the following steps:
S1221:统计所有高风险值的原始训练数据在预设时间内的出行次数和地点变更次数。S1221: Count the number of trips and the number of place changes of the original training data of all high-risk values within a preset time.
具体地,从原始训练数据中选取所有标注有高风险值对应的运输对象(即高风险对象)的历史出行数据,并统计该历史出行数据中预设时间内的高风险值对应的运输对象(即高风险对象)的出行次数和地点变更次数。本实施例中,该预设时间与同行特征的出行时间相同,以使同行特征和口岸漂移特征相关联,提高模型识别的准确率。Specifically, historical travel data of all transport objects (ie, high-risk objects) corresponding to high-risk values are selected from the original training data, and the transport objects corresponding to the high-risk values in the preset travel time are counted ( That is, the number of trips and the number of changes in location of the high-risk object. In this embodiment, the preset time is the same as the travel time of the peer feature, so that the peer feature and the port drift feature are correlated, and the accuracy of the model recognition is improved.
S1222:采用公式S=Y/X对出行次数和地点变更次数进行计算,获取口岸漂移特征。其中,Y为出行地点变更次数,X为出行次数。S1222: Calculate the number of trips and the number of locations change using the formula S=Y/X to obtain the port drift feature. Where Y is the number of travel location changes and X is the number of trips.
具体地,通过公式S=Y/X计算口岸漂移特征。其中,X表示运输对象在预设时间内的总的出行次数,Y表示运输对象在预设时间内的出行地点变更次数。通过计算高风险对象的S值即为口岸漂移特征,用以作为确定高风险用户的判定阈值。本实施例中,通过计算高风险对象的口岸漂移特征即得到的S值一般都大于1,由此可知高风险对象一般不会更改出行地点的结论。Specifically, the port drift characteristic is calculated by the formula S=Y/X. Where X is the total number of trips of the transport object within the preset time, and Y is the number of trips of the travel object within the preset time. By calculating the S value of the high-risk object, it is the port drift feature, which is used as the decision threshold for determining the high-risk user. In this embodiment, the S value obtained by calculating the port drift characteristic of the high-risk object is generally greater than 1, so that it is known that the high-risk object generally does not change the travel location.
本实施例中,从原始训练数据中选取所有标注有高风险值对应的运输对象(即高风险对象)的历史出行数据,并统计该历史出行数据中预设时间内的高风险对象的出行次数和地点变更次数,以便采用口岸漂移特征计算公式S=Y/X计算高风险对象的口岸漂移特征,即获取高风险对象一般不会更改出行地点的特征,以达到辅助业务方判定运输对象是否为高风险对象的目的。In this embodiment, historical travel data of all transport objects (ie, high-risk objects) corresponding to high-risk values are selected from the original training data, and the number of trips of high-risk objects within the preset time in the historical travel data is counted. And the number of location changes, in order to calculate the port drift characteristics of high-risk objects by using the port drift characteristic calculation formula S=Y/X, that is, obtaining high-risk objects generally does not change the characteristics of the travel locations, so that the auxiliary business party determines whether the transport object is The purpose of high risk objects.
在一具体实施方式中,如图5所示,步骤S14中,即采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型,具体包括如下步骤:In a specific implementation, as shown in FIG. 5, in step S14, the decision tree algorithm is used to train the target training data in the training set to obtain the original risk model, which specifically includes the following steps:
S141:初始化决策树算法对应的层级参数。S141: Initialize a hierarchical parameter corresponding to the decision tree algorithm.
其中,层级参数是决策树在生长过程中的最大生长层的参数,即初始化决策树的停止***的条件,以使决策树不再无限生长,以防止模型过拟合,达到在相对短的时间内对大型数据源做出可行且效果良好的预测结果的目的,提高模型识别的准确率。The hierarchical parameter is the parameter of the maximum growth layer of the decision tree during the growth process, that is, the condition for stopping the splitting of the decision tree, so that the decision tree is no longer infinitely grown, so as to prevent the model from overfitting, reaching a relatively short time. The purpose of making feasible and effective prediction results for large data sources is to improve the accuracy of model recognition.
S142:采用CART算法对训练集中的目标训练数据进行训练,在决策树的生长层数达到层级参数时,获取原始风险模型。S142: The target training data in the training set is trained by using the CART algorithm, and the original risk model is obtained when the growth layer of the decision tree reaches the hierarchical parameter.
其中,CART(Classification And Regression Tree,分类回归树)算法,是一种采用二分递归分割的技术进行学习的算法,该算法总是将当前样本集分割为两个子样本集,使得生成的决策树的每个非叶结点都只有两个分支。由于CART算法生成的决策树是结构简洁的二叉树,因此,CART算法适用于样本特征的取值为是或非的场景。具体地,采用CART算法对训练集中的目标训练数据进行训练的过程,即决策树的生长过程。由于CART 算法过程包括生长过程和剪枝过程,但本实施例中,在树的生长时会通过初始化的层级参数去限制树的生长,因此不需要CATR算法中的剪枝过程。Among them, the CART (Classification And Regression Tree) algorithm is an algorithm that uses the technique of two-way recursive segmentation to divide the current sample set into two sub-sample sets, so that the generated decision tree There are only two branches for each non-leaf node. Since the decision tree generated by the CART algorithm is a simple binary tree, the CART algorithm is applicable to scenarios where the sample features have a yes or no value. Specifically, the CART algorithm is used to train the target training data in the training set, that is, the growth process of the decision tree. Since the CART algorithm process includes a growth process and a pruning process, in the present embodiment, the tree growth is restricted by the initialized hierarchical parameters during the growth of the tree, so the pruning process in the CATR algorithm is not required.
在一具体实施方式中,步骤S142中,即采用CART算法对训练集中的目标训练数据进行训练,在决策树的生长层数达到层级参数时,获取原始风险模型,具体包括如下步骤:In a specific implementation, in step S142, the CART algorithm is used to train the target training data in the training set. When the number of growing layers of the decision tree reaches the hierarchical parameter, the original risk model is obtained, which specifically includes the following steps:
S1421:采用公式
Figure PCTCN2018094183-appb-000001
Figure PCTCN2018094183-appb-000002
计算每一维度特征对应的基尼系数;其中,D为训练集,α为维度特征,P k为维度特征的概率。
S1421: Using the formula
Figure PCTCN2018094183-appb-000001
with
Figure PCTCN2018094183-appb-000002
Calculate the Gini coefficient corresponding to each dimension feature; where D is the training set, α is the dimensional feature, and P k is the probability of the dimensional feature.
具体地,CART算法的计算公式为
Figure PCTCN2018094183-appb-000003
Figure PCTCN2018094183-appb-000004
计算维度特征所对应的基尼系数;其中,D为训练集,
Figure PCTCN2018094183-appb-000005
为维度特征,如本实施例中的同行特征和口岸漂移特特征等,P k为训练集中的目标训练数据属于第k个维度特征的概率,D v表示D中在维度特征α上的取值为α v的所有样本集合。
Specifically, the calculation formula of the CART algorithm is
Figure PCTCN2018094183-appb-000003
with
Figure PCTCN2018094183-appb-000004
Calculating the Gini coefficient corresponding to the dimensional feature; where D is the training set,
Figure PCTCN2018094183-appb-000005
For the dimensional feature, such as the peer feature and the port drift feature in the embodiment, P k is the probability that the target training data in the training set belongs to the kth dimension feature, and D v represents the value of the dimension feature α in D. A collection of all samples for α v .
S1422:选取最小基尼系数对应的维度特征作为决策树的根节点。S1422: Select the dimension feature corresponding to the minimum Gini coefficient as the root node of the decision tree.
具体地,选取最小基尼系数对应的维度特征和对应的属性值(如性别对应的属性值为“男”和“女”)作为最优的特征和最优切分点(即最优属性值)作为决策树生长的根节点。Specifically, the dimension feature corresponding to the minimum Gini coefficient and the corresponding attribute value (for example, the attribute values corresponding to the gender are “male” and “female”) are selected as the optimal feature and the optimal segmentation point (ie, the optimal attribute value). As the root node of the decision tree growth.
S1423:基于决策树的根节点,重复执行计算维度特征所对应的基尼系数的步骤,直至决策树的生长层数达到层级参数的条件时,获取原始风险模型。S1423: The step of calculating the Gini coefficient corresponding to the dimension feature is repeatedly performed based on the root node of the decision tree, and the original risk model is obtained when the growth layer of the decision tree reaches the condition of the hierarchical parameter.
具体地,基于决策树根节点会将目标训练数据分为N部分,N取决于根节点的属性值的数量,然后重复执行计算维度特征所对应的基尼系数的步骤即步骤S1421,计算剩余维度特征在根节点作用下的基尼系数,直至决策树的生长层数达到层级参数的条件时,停止决策树的生长过程,获取原始风险模型。Specifically, the root node of the decision tree divides the target training data into N parts, N depends on the number of attribute values of the root node, and then repeats the step of calculating the Gini coefficient corresponding to the dimension feature, that is, step S1421, and calculates the remaining dimension features. The Gini coefficient under the action of the root node stops the growth process of the decision tree and obtains the original risk model until the growth layer of the decision tree reaches the condition of the hierarchical parameter.
本实施例中,先初始化决策树算法对应的层级参数,以使决策树不再无限生长,以防止模型过拟合,达到在相对短的时间内对大型数据源做出可行且效果良好的结果的目的,提高模型的准确率。然后,采用CART算法对训练集中的目标训练数据进行训练,即决策 树的生长过程。在决策树在生长时,通过计算每一维度特征的基尼系数,选取最小基尼系数对应的维度特征和对应的属性值作为最优特征和最优切分点作为决策树生长的根节点,然后继续迭代直到满足决策树的生长层数达到层级参数时,停止生长,以获取原始风险模型。In this embodiment, the hierarchical parameters corresponding to the decision tree algorithm are initialized first, so that the decision tree is no longer infinitely grown, so as to prevent the model from over-fitting, and the feasible and effective result is obtained for a large data source in a relatively short time. The purpose is to improve the accuracy of the model. Then, the CART algorithm is used to train the target training data in the training set, that is, the growth process of the decision tree. When the decision tree is growing, by calculating the Gini coefficient of each dimension feature, the dimensional features corresponding to the minimum Gini coefficient and the corresponding attribute values are selected as the optimal feature and the optimal segmentation point as the root node of the decision tree growth, and then continue. Iterate until the number of growth layers that satisfy the decision tree reaches the level parameter, and stop growing to obtain the original risk model.
本实施例中,通过对历史出行数据进行风险值标注,获取原始训练数据,以便对原始训练数据进行同行分析获取同行特征,通过同行分析能够锁定高风险用户群体,有效辅助业务方查获大量的高风险对象的同伙。通过统计高风险对象的历史出行数据,获取高风险对象的口岸漂移特征即高风险对象一般不会更改出行地点。然后,将通过同行分析得到的同行特征和通过口岸漂移分析得到的口岸漂移特征作为特征因子加入到模型训练中,以获取原始训练数据,以使基于原始训练数据获取的风险模型的识别效果更佳。最后,对原始训练数据进行缺失值处理和离散变量编码,获取目标训练数据,方便计算,提高风险模型训练的效率。然后,按照预设时间对目标训练数据进行拆分,获取训练集和测试集,保证了模型在时间上的预测能力。采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型,该决策树算法可以在相对短的时间内对大型数据源做出可行且效果良好的结果提高风险模型的准确率,并且决策树只需要一次构建,反复使用,提高风险模型的识别效率。最后,采用测试集对原始风险模型进行测试,获取目标风险模型,进一步提高风险模型的准确率,以使目标风险模型的辅助效果更佳。In this embodiment, the risk data is marked on the historical travel data to obtain the original training data, so that the original training data is peer-analyzed to obtain the peer characteristics, and the peer-to-peer analysis can lock the high-risk user group, effectively assisting the business party to obtain a large number of high The accomplices of the risk object. By counting the historical travel data of high-risk objects, the port drift characteristics of high-risk objects are obtained, that is, high-risk objects generally do not change the travel location. Then, the peer characteristics obtained by peer analysis and the port drift characteristics obtained by port drift analysis are added as feature factors to the model training to obtain the original training data, so that the risk model based on the original training data can be better recognized. . Finally, the original training data is processed by missing values and discrete variables to obtain target training data, which is convenient for calculation and improves the efficiency of risk model training. Then, the target training data is split according to the preset time, and the training set and the test set are obtained, which ensures the prediction ability of the model in time. The decision tree algorithm is used to train the target training data in the training set to obtain the original risk model. The decision tree algorithm can improve the accuracy of the risk model by making feasible and effective results for large data sources in a relatively short time, and The decision tree only needs to be constructed once and used repeatedly to improve the recognition efficiency of the risk model. Finally, the original risk model is tested by using the test set to obtain the target risk model, and the accuracy of the risk model is further improved, so that the auxiliary effect of the target risk model is better.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence of the steps in the above embodiments does not mean that the order of execution is performed. The order of execution of each process should be determined by its function and internal logic, and should not be construed as limiting the implementation process of the embodiments of the present application.
实施例2Example 2
图6示出与实施例1中风险模型训练方法一一对应的风险模型训练装置的原理框图。如图6所示,该风险模型训练装置包括原始训练数据获取模块11、目标训练数据获取模块12、目标训练数据划分模块13、原始风险模型获取模块14和目标风险模型获取模块15。其中,原始训练数据获取模块11、目标训练数据获取模块12、目标训练数据划分模块13、原始风险模型获取模块14和目标风险模型获取模块15的实现功能与实施例中风险模型训练方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。Fig. 6 is a block diagram showing the principle of the risk model training device corresponding to the risk model training method of the first embodiment. As shown in FIG. 6, the risk model training device includes an original training data acquisition module 11, a target training data acquisition module 12, a target training data division module 13, an original risk model acquisition module 14, and a target risk model acquisition module 15. The steps of the original training data obtaining module 11, the target training data acquiring module 12, the target training data dividing module 13, the original risk model obtaining module 14 and the target risk model obtaining module 15 and the risk model training method in the embodiment One-to-one correspondence, in order to avoid redundancy, this embodiment will not be described in detail.
原始训练数据获取模块11,用于对历史出行数据进行风险值标注,获取原始训练数据。The original training data obtaining module 11 is configured to perform risk value labeling on historical travel data to obtain original training data.
目标训练数据获取模块12,用于对原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据。The target training data obtaining module 12 is configured to perform peer analysis and port drift analysis on the original training data to obtain target training data.
目标训练数据划分模块13,用于按照预设时间对目标训练数据进行拆分,获取训练集和测试集。The target training data dividing module 13 is configured to split the target training data according to a preset time to obtain a training set and a test set.
原始风险模型获取模块14,用于采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型。The original risk model obtaining module 14 is configured to use the decision tree algorithm to train the target training data in the training set to obtain the original risk model.
目标风险模型获取模块15,用于采用测试集对原始风险模型进行测试,获取目标风险模型。The target risk model obtaining module 15 is configured to test the original risk model by using the test set to obtain the target risk model.
优选地,目标训练数据获取模块12包括同行特征获取单元121、口岸漂移特征获取单元122、中间训练数据获取单元123和目标训练数据获取单元124。Preferably, the target training data acquisition module 12 includes a peer feature acquisition unit 121, a port drift feature acquisition unit 122, an intermediate training data acquisition unit 123, and a target training data acquisition unit 124.
同行特征获取单元121,用于对原始训练数据进行同行分析,获取同行特征。The peer feature acquiring unit 121 is configured to perform peer analysis on the original training data to acquire peer features.
口岸漂移特征获取单元122,用于对原始训练数据进行口岸漂移分析,获取口岸漂移特征。The port drift feature acquiring unit 122 is configured to perform port drift analysis on the original training data to obtain a port drift feature.
中间训练数据获取单元123,用于基于同行特征和口岸漂移特征,获取中间训练数据。The intermediate training data acquiring unit 123 is configured to acquire intermediate training data based on the peer feature and the port drift feature.
目标训练数据获取单元124,用于对中间训练数据进行缺失值处理和离散变量编码,获取目标训练数据。The target training data obtaining unit 124 is configured to perform missing value processing and discrete variable encoding on the intermediate training data to obtain target training data.
优选地,同行特征获取单元121包括历史出行时间获取单元1211和同行特征获取单元1212。Preferably, the peer feature acquisition unit 121 includes a historical travel time acquisition unit 1211 and a peer feature acquisition unit 1212.
历史出行时间获取单元1211,用于获取所有高风险值的原始训练数据对应的历史出行时间。The historical travel time obtaining unit 1211 is configured to acquire historical travel time corresponding to the original training data of all high risk values.
同行特征获取单元1212,用于对历史出行时间进行区间划分,获取同行特征。The peer feature acquisition unit 1212 is configured to divide the historical travel time into sections and acquire the peer features.
优选地,口岸漂移特征获取单元122包括原始训练数据统计单元1221和口岸漂移特征获取1222。Preferably, the port drift feature acquisition unit 122 includes an original training data statistics unit 1221 and a port drift feature acquisition 1222.
原始训练数据统计单元1221,用于统计所有高风险值的原始训练数据在预设时间内的出行次数和地点变更次数。The original training data statistic unit 1221 is configured to count the number of trips and the number of location changes of the original training data of all high risk values within a preset time.
口岸漂移特征获取1222,用于采用公式S=Y/X对出行次数和地点变更次数进行计算,获取口岸漂移特征。其中,Y为出行地点变更次数,X为出行次数。The port drift feature acquisition 1222 is used to calculate the number of trips and the number of location changes using the formula S=Y/X to obtain the port drift characteristics. Where Y is the number of travel location changes and X is the number of trips.
优选地,原始风险模型获取模块14包括算法参数初始单元141和原始风险模型获取单元142。Preferably, the original risk model acquisition module 14 includes an algorithm parameter initial unit 141 and an original risk model acquisition unit 142.
算法参数初始单元141,用于初始化决策树算法对应的层级参数。The algorithm parameter initial unit 141 is configured to initialize a hierarchical parameter corresponding to the decision tree algorithm.
原始风险模型获取单元142,用于采用CART算法对训练集中的目标训练数据进行训练,在决策树的生长层数达到层级参数时,获取原始风险模型。The original risk model obtaining unit 142 is configured to train the target training data in the training set by using the CART algorithm, and obtain the original risk model when the number of growing layers of the decision tree reaches the hierarchical parameter.
优选地,目标训练数据包括至少两个维度特征。Preferably, the target training data includes at least two dimensional features.
原始风险模型获取单元142包括基尼系数获取子单元1421、根节点获取子单元1422和原始风险模型获取子单元1423。The original risk model acquisition unit 142 includes a Gini coefficient acquisition sub-unit 1421, a root node acquisition sub-unit 1422, and an original risk model acquisition sub-unit 1423.
基尼系数获取子单元1421,用于采用公式
Figure PCTCN2018094183-appb-000006
Figure PCTCN2018094183-appb-000007
计算每一维度特征对应的基尼系数。其中,D为训练集,α为维度特征,P k为维度特征的概率。
The Gini coefficient acquisition sub-unit 1421 is used to adopt a formula
Figure PCTCN2018094183-appb-000006
with
Figure PCTCN2018094183-appb-000007
Calculate the Gini coefficient corresponding to each dimension feature. Where D is the training set, α is the dimensional feature, and P k is the probability of the dimensional feature.
根节点获取子单元1422,用于选取最小基尼系数对应的维度特征作为决策树的根节点。The root node obtaining sub-unit 1422 is configured to select a dimension feature corresponding to the minimum Gini coefficient as a root node of the decision tree.
原始风险模型获取子单元1423,用于基于决策树的根节点,重复执行计算维度特征所对应的基尼系数的步骤,直至决策树的生长层数达到层级参数的条件时,获取原始风险模型。The original risk model obtaining sub-unit 1423 is configured to repeatedly perform the step of calculating the Gini coefficient corresponding to the dimension feature based on the root node of the decision tree, and obtain the original risk model when the number of growth layers of the decision tree reaches the condition of the hierarchical parameter.
实施例3Example 3
图7示出本实施例中风险识别方法的流程图。该风险识别方法可应用在司法机构或其他机构的计算机设备上,以便对运输对象的历史出行数据进行查验,以达到辅助业务方分析运输对象的风险等级的目的。如图7所示,该风险模型训练方法包括如下步骤:Fig. 7 is a flow chart showing the risk identification method in this embodiment. The risk identification method can be applied to computer equipment of a judicial institution or other institutions to check the historical travel data of the transport object, so as to achieve the purpose of the auxiliary business side analyzing the risk level of the transport object. As shown in FIG. 7, the risk model training method includes the following steps:
S21:获取待识别出行数据。S21: Acquire a line data to be identified.
其中,待识别出行数据指运输对象在出行时实时采集到的用于识别是否是否有风险的行为数据。该待识别出行数据包括但不限于运输对象的出行时间、出行地点和查验情况等,还包括运输对象自身的基础特征(例如,性别和年龄)。具体地,查验情况是指在对运输对象进行风险识别之前,查询该风险对象是否经过查验风险的情况。The data to be identified refers to the behavior data collected by the transportation object in real time for identifying whether there is any risk. The travel data to be identified includes, but is not limited to, the travel time of the transport object, the travel location and the inspection situation, and the basic characteristics of the transport object itself (for example, gender and age). Specifically, the inspection situation refers to a situation in which the risk object is checked for risk before the risk identification of the transportation object.
S22:将待识别出行数据输入到目标风险模型进行识别,获取风险识别结果。S22: Input the data to be identified into the target risk model for identification, and obtain the risk identification result.
其中,目标风险模型是采用实施例1风险模型训练方法获取的模型,采用该目标风险模型对待识别数据进行识别,使得风险识别结果的准确率更高。The target risk model is a model obtained by using the risk model training method of Embodiment 1, and the target risk model is used to identify the identification data, so that the accuracy of the risk identification result is higher.
本实施例中,将待识别出行数据输入到目标风险模型中进行识别,在目标风险模型中对输入的待识别出行数据进行决策并输出风险识别结果。具体地,计算机设备在获取运输对象A的待识别出行数据后,将待识别出行数据在目标风险模型中进行决策并输出识别结果。In this embodiment, the line data to be identified is input into the target risk model for identification, and the input line data to be identified is determined in the target risk model and the risk identification result is output. Specifically, after acquiring the data to be recognized of the transportation object A, the computer device makes a decision in the target risk model and outputs the recognition result.
本实施例所提供的风险识别方法中,通过获取待识别出行数据,以便将待识别出行数据输入到目标风险模型进行识别,获取风险识别结果,保证识别的准确率更高,较精准对待识别出行数据进行识别,以辅助业务方快速锁定高风险用户,以便及时采取措施。In the risk identification method provided by the embodiment, the line data to be identified is obtained, so that the line data to be identified is input into the target risk model for identification, and the risk identification result is obtained, so that the accuracy of the recognition is higher, and the trip is more accurately treated. Data is identified to assist the business to quickly target high-risk users in order to take timely action.
实施例4Example 4
图8示出与实施例3中风险识别方法一一对应的风险识别装置的原理框图。如图8所示,该风险识别装置包括待识别出行数据获取模块21和风险识别结果获取模块22。其中,待识别出行数据获取模块21和风险识别结果获取模块22的实现功能与实施例3中风险识别方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。Fig. 8 is a block diagram showing the principle of the risk identification device corresponding to the risk identification method in the third embodiment. As shown in FIG. 8, the risk identification device includes a line data acquisition module 21 to be identified and a risk identification result acquisition module 22. The implementation functions of the to-be-identified travel data acquisition module 21 and the risk identification result acquisition module 22 correspond to the steps corresponding to the risk identification method in the third embodiment. To avoid redundancy, the present embodiment will not be described in detail.
待识别出行数据获取模块21,用于获取待识别出行数据。The travel data acquisition module 21 is configured to obtain the data to be recognized.
风险识别结果获取模块22,用于将待识别出行数据输入到目标风险模型进行识别,获取风险识别结果;The risk identification result obtaining module 22 is configured to input the line data to be identified into the target risk model for identification, and obtain the risk identification result;
其中,目标风险模型是采用实施例1中的风险模型训练方法获取的模型。The target risk model is a model obtained by using the risk model training method in Embodiment 1.
实施例5Example 5
本实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例1中风险模型训练方法,为避免重复,这里不再赘述。The embodiment provides one or more non-volatile readable storage media having computer readable instructions that, when executed by one or more processors, cause the one or more processors to execute The risk model training method in Embodiment 1 is implemented. To avoid repetition, details are not described herein again.
或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例2中风险模型训练装置中各模块/单元的功能,为避免重复,这里不再赘述;Alternatively, when the computer readable instructions are executed by one or more processors, such that the one or more processors execute to implement the functions of the modules/units in the risk model training device of Embodiment 2, in order to avoid duplication, I will not repeat them here;
或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例3中风险识别方法,为避免重复,这里不再赘述;Alternatively, when the computer readable instructions are executed by one or more processors, when the one or more processors are executed, the risk identification method in Embodiment 3 is implemented. To avoid repetition, no further details are provided herein;
或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例4中风险识别装置中各模块/单元的功能,为避免重复,这里不再赘述。Alternatively, the computer readable instructions are executed by one or more processors such that when executed by the one or more processors, the functions of the modules/units in the risk identification device of Embodiment 4 are implemented, to avoid repetition, here No longer.
实施例6Example 6
图9是本申请一实施例提供的计算机设备的示意图。如图9所示,该实施例的计算机设备90包括:处理器91、存储器92以及存储在存储器92中并可在处理器91上运行的计算机可读指令93。处理器91执行计算机可读指令93时实现上述实施例1中风险模型训练 各个方法的步骤,为避免重复,此处不一一赘述。或者,处理器91执行计算机可读指令93时实现上述实施例2中风险模型训练装置的各模块/单元的功能,为避免重复,此处不一一赘述;或者,处理器91执行计算机可读指令93时实现上述实施例3中风险识别方法的步骤,为避免重复,此处不一一赘述;或者,处理器91执行计算机可读指令93时实现上述实施例4中风险识别装置的各模块/单元的功能,为避免重复,此处不一一赘述。FIG. 9 is a schematic diagram of a computer device according to an embodiment of the present application. As shown in FIG. 9, computer device 90 of this embodiment includes a processor 91, a memory 92, and computer readable instructions 93 stored in memory 92 and executable on processor 91. When the processor 91 executes the computer readable instructions 93, the steps of the various methods of the risk model training in the above Embodiment 1 are implemented. To avoid repetition, details are not described herein. Alternatively, when the processor 91 executes the computer readable instructions 93, the functions of the modules/units of the risk model training device in the second embodiment are implemented. To avoid repetition, the details are not described herein; or the processor 91 executes the computer readable The steps of the risk identification method in the foregoing embodiment 3 are implemented when the instruction 93 is implemented. To avoid repetition, the details are not described herein. Alternatively, when the processor 91 executes the computer readable instructions 93, the modules of the risk identification device in the above embodiment 4 are implemented. / The function of the unit, in order to avoid duplication, we will not repeat them here.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。It will be apparent to those skilled in the art that, for convenience and brevity of description, only the division of each functional unit and module described above is exemplified. In practical applications, the above functions may be assigned to different functional units as needed. The module is completed by dividing the internal structure of the device into different functional units or modules to perform all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to explain the technical solutions of the present application, and are not limited thereto; although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing embodiments. The technical solutions described in the examples are modified or equivalently replaced with some of the technical features; and the modifications or substitutions do not deviate from the spirit and scope of the technical solutions of the embodiments of the present application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种风险模型训练方法,其特征在于,包括:A risk model training method, comprising:
    对历史出行数据进行风险值标注,获取原始训练数据;The historical travel data is marked with risk values to obtain the original training data;
    对所述原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据;Perform peer analysis and port drift analysis on the original training data to obtain target training data;
    按照预设时间对所述目标训练数据进行拆分,获取训练集和测试集;Performing splitting of the target training data according to a preset time to obtain a training set and a test set;
    采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型;The decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
    采用测试集对所述原始风险模型进行测试,获取目标风险模型。The original risk model is tested using a test set to obtain a target risk model.
  2. 如权利要求1所述的风险模型训练方法,其特征在于,所述对所述原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据,包括:The risk model training method according to claim 1, wherein the performing peer analysis and port drift analysis on the original training data, and acquiring target training data, comprising:
    对所述原始训练数据进行同行分析,获取同行特征;Perform peer analysis on the original training data to obtain peer characteristics;
    对所述原始训练数据进行口岸漂移分析,获取口岸漂移特征;Performing port drift analysis on the original training data to obtain port drift characteristics;
    基于所述同行特征和所述口岸漂移特征,获取中间训练数据;Obtaining intermediate training data based on the peer feature and the port drift feature;
    对所述中间训练数据进行缺失值处理和离散变量编码,获取目标训练数据。Performing missing value processing and discrete variable encoding on the intermediate training data to acquire target training data.
  3. 如权利要求2所述的风险模型训练方法,其特征在于,所述对原始训练数据进行同行分析,包括:The risk model training method according to claim 2, wherein the performing peer analysis on the original training data comprises:
    获取所有高风险值的原始训练数据对应的历史出行时间;Obtain historical travel time corresponding to the original training data of all high risk values;
    对历史出行时间进行区间划分,获取同行特征。Interval division of historical travel time to obtain peer characteristics.
  4. 如权利要求2所述的风险模型训练方法,其特征在于,所述对原始训练数据进行口岸漂移分析,获取口岸漂移特征,包括:The risk model training method according to claim 2, wherein the port training analysis is performed on the original training data to obtain port drift characteristics, including:
    统计所有高风险值的原始训练数据在预设时间内的出行次数和地点变更次数;Count the number of trips and the number of location changes of the original training data of all high-risk values within the preset time;
    采用公式S=Y/X对所述出行次数和地点变更次数进行计算,获取口岸漂移特征;其中,Y为所述出行地点变更次数,X为所述出行次数。The travel time and the number of location changes are calculated by using the formula S=Y/X to obtain the port drift feature; wherein Y is the number of travel location changes, and X is the travel time.
  5. 如权利要求1所述的风险模型训练方法,其特征在于,所述采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型,包括:The risk model training method according to claim 1, wherein the decision tree algorithm is used to train target training data in the training set to obtain an original risk model, including:
    初始化决策树算法对应的层级参数;Initializing the hierarchical parameters corresponding to the decision tree algorithm;
    采用CART算法对训练集中的目标训练数据进行训练,在决策树的生长层数达到所述层级参数时,获取所述原始风险模型;The target training data in the training set is trained by using the CART algorithm, and the original risk model is obtained when the number of growing layers of the decision tree reaches the level parameter;
    所述目标训练数据包括至少两个维度特征;The target training data includes at least two dimensional features;
    所述采用CART算法对训练集中的目标训练数据进行训练,在决策树的生长层数达到 所述层级参数时,获取所述原始风险模型,包括:The CART algorithm is used to train the target training data in the training set, and when the growth layer of the decision tree reaches the hierarchical parameter, the original risk model is obtained, including:
    采用公式
    Figure PCTCN2018094183-appb-100001
    Figure PCTCN2018094183-appb-100002
    计算每一所述维度特征对应的基尼系数;其中,D为所述训练集,α为所述维度特征,P k为所述维度特征的概率;
    Adopt formula
    Figure PCTCN2018094183-appb-100001
    with
    Figure PCTCN2018094183-appb-100002
    Calculating a Gini coefficient corresponding to each of the dimensional features; wherein D is the training set, α is the dimensional feature, and P k is a probability of the dimensional feature;
    选取最小基尼系数对应的维度特征作为决策树的根节点;Select the dimension feature corresponding to the minimum Gini coefficient as the root node of the decision tree;
    基于所述决策树的根节点,重复执行计算维度特征所对应的基尼系数的步骤,直至决策树的生长层数达到层级参数的条件时,获取原始风险模型。Based on the root node of the decision tree, the step of calculating the Gini coefficient corresponding to the dimension feature is repeatedly performed until the growth layer of the decision tree reaches the condition of the hierarchical parameter, and the original risk model is obtained.
  6. 一种风险识别方法,其特征在于,包括:A risk identification method, comprising:
    获取待识别出行数据;Obtaining the travel data to be identified;
    将所述待识别出行数据输入到所述目标风险模型进行识别,获取风险识别结果;Inputting the to-be-identified travel data into the target risk model for identification, and acquiring a risk identification result;
    其中,所述目标风险模型是采用权利要求1-5任一项所述风险模型训练方法获取的模型。The target risk model is a model obtained by using the risk model training method according to any one of claims 1-5.
  7. 一种风险模型训练装置,其特征在于,包括:A risk model training device, comprising:
    原始训练数据获取模块,用于对历史出行数据进行风险值标注,获取原始训练数据;The original training data acquisition module is configured to perform risk value labeling on historical travel data to obtain original training data;
    目标训练数据获取模块,用于对所述原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据;a target training data acquiring module, configured to perform peer analysis and port drift analysis on the original training data, and acquire target training data;
    目标训练数据划分模块,用于按照预设时间对所述目标训练数据进行拆分,获取训练集和测试集;a target training data dividing module, configured to split the target training data according to a preset time to obtain a training set and a test set;
    原始风险模型获取模块,用于采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型;The original risk model acquisition module is configured to train the target training data in the training set by using a decision tree algorithm to obtain the original risk model;
    目标风险模型获取模块,用于采用测试集对原始风险模型进行测试,获取目标风险模型。The target risk model acquisition module is used to test the original risk model by using the test set to obtain the target risk model.
  8. 一种风险识别装置,其特征在于,包括:A risk identification device, comprising:
    待识别出行数据获取模块,用于获取待识别出行数据;The data acquisition module to be identified is used to obtain the data to be identified;
    风险识别结果获取模块,用于将所述待识别出行数据输入到所述目标风险模型进行识别,获取风险识别结果;a risk identification result obtaining module, configured to input the to-be-identified travel data into the target risk model for identification, and obtain a risk identification result;
    其中,所述目标风险模型是采用权利要求1-5任一项所述风险模型训练方法获取的模型。The target risk model is a model obtained by using the risk model training method according to any one of claims 1-5.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理 器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and operative on the processor, wherein the processor executes the computer readable instructions as follows step:
    对历史出行数据进行风险值标注,获取原始训练数据;The historical travel data is marked with risk values to obtain the original training data;
    对所述原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据;Perform peer analysis and port drift analysis on the original training data to obtain target training data;
    按照预设时间对所述目标训练数据进行拆分,获取训练集和测试集;Performing splitting of the target training data according to a preset time to obtain a training set and a test set;
    采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型;The decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
    采用测试集对所述原始风险模型进行测试,获取目标风险模型。The original risk model is tested using a test set to obtain a target risk model.
  10. 如权利要求9所述的计算机设备,其特征在于,所述对所述原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据,包括:The computer device according to claim 9, wherein the performing peer analysis and port drift analysis on the original training data to obtain target training data comprises:
    对所述原始训练数据进行同行分析,获取同行特征;Perform peer analysis on the original training data to obtain peer characteristics;
    对所述原始训练数据进行口岸漂移分析,获取口岸漂移特征;Performing port drift analysis on the original training data to obtain port drift characteristics;
    基于所述同行特征和所述口岸漂移特征,获取中间训练数据;Obtaining intermediate training data based on the peer feature and the port drift feature;
    对所述中间训练数据进行缺失值处理和离散变量编码,获取目标训练数据。Performing missing value processing and discrete variable encoding on the intermediate training data to acquire target training data.
  11. 如权利要求10所述的计算机设备,其特征在于,所述对原始训练数据进行同行分析,包括:The computer apparatus according to claim 10, wherein said performing peer analysis on the original training data comprises:
    获取所有高风险值的原始训练数据对应的历史出行时间;Obtain historical travel time corresponding to the original training data of all high risk values;
    对历史出行时间进行区间划分,获取同行特征。Interval division of historical travel time to obtain peer characteristics.
  12. 如权利要求10所述的计算机设备,其特征在于,所述对原始训练数据进行口岸漂移分析,获取口岸漂移特征,包括:The computer device according to claim 10, wherein the port drift analysis is performed on the original training data to obtain a port drift characteristic, including:
    统计所有高风险值的原始训练数据在预设时间内的出行次数和地点变更次数;Count the number of trips and the number of location changes of the original training data of all high-risk values within the preset time;
    采用公式S=Y/X对所述出行次数和地点变更次数进行计算,获取口岸漂移特征;其中,Y为所述出行地点变更次数,X为所述出行次数。The travel time and the number of location changes are calculated by using the formula S=Y/X to obtain the port drift feature; wherein Y is the number of travel location changes, and X is the travel time.
  13. 如权利要求9所述的计算机设备,其特征在于,所述采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型,包括:The computer device according to claim 9, wherein the decision tree algorithm is used to train target training data in the training set to obtain an original risk model, including:
    初始化决策树算法对应的层级参数;Initializing the hierarchical parameters corresponding to the decision tree algorithm;
    采用CART算法对训练集中的目标训练数据进行训练,在决策树的生长层数达到所述层级参数时,获取所述原始风险模型;The target training data in the training set is trained by using the CART algorithm, and the original risk model is obtained when the number of growing layers of the decision tree reaches the level parameter;
    所述目标训练数据包括至少两个维度特征;The target training data includes at least two dimensional features;
    所述采用CART算法对训练集中的目标训练数据进行训练,在决策树的生长层数达到所述层级参数时,获取所述原始风险模型,包括:The CART algorithm is used to train the target training data in the training set, and when the growth layer of the decision tree reaches the hierarchical parameter, the original risk model is obtained, including:
    采用公式
    Figure PCTCN2018094183-appb-100003
    Figure PCTCN2018094183-appb-100004
    计算每一所述维度特征对应的基尼系数;其中,D为所述训练集,α为所述维度特征,P k为所述维度特征的概率;
    Adopt formula
    Figure PCTCN2018094183-appb-100003
    with
    Figure PCTCN2018094183-appb-100004
    Calculating a Gini coefficient corresponding to each of the dimensional features; wherein D is the training set, α is the dimensional feature, and P k is a probability of the dimensional feature;
    选取最小基尼系数对应的维度特征作为决策树的根节点;Select the dimension feature corresponding to the minimum Gini coefficient as the root node of the decision tree;
    基于所述决策树的根节点,重复执行计算维度特征所对应的基尼系数的步骤,直至决策树的生长层数达到层级参数的条件时,获取原始风险模型。Based on the root node of the decision tree, the step of calculating the Gini coefficient corresponding to the dimension feature is repeatedly performed until the growth layer of the decision tree reaches the condition of the hierarchical parameter, and the original risk model is obtained.
  14. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and operative on the processor, wherein the processor executes the computer readable instructions as follows step:
    获取待识别出行数据;Obtaining the travel data to be identified;
    将所述待识别出行数据输入到所述目标风险模型进行识别,获取风险识别结果;Inputting the to-be-identified travel data into the target risk model for identification, and acquiring a risk identification result;
    其中,所述目标风险模型是采用权利要求1-5任一项所述风险模型训练方法获取的模型。The target risk model is a model obtained by using the risk model training method according to any one of claims 1-5.
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-transitory readable storage mediums storing computer readable instructions, wherein when the computer readable instructions are executed by one or more processors, cause the one or more processors to execute The following steps:
    对历史出行数据进行风险值标注,获取原始训练数据;The historical travel data is marked with risk values to obtain the original training data;
    对所述原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据;Perform peer analysis and port drift analysis on the original training data to obtain target training data;
    按照预设时间对所述目标训练数据进行拆分,获取训练集和测试集;Performing splitting of the target training data according to a preset time to obtain a training set and a test set;
    采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型;The decision tree algorithm is used to train the target training data in the training set to obtain the original risk model.
    采用测试集对所述原始风险模型进行测试,获取目标风险模型。The original risk model is tested using a test set to obtain a target risk model.
  16. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述对所述原始训练数据进行同行分析和口岸漂移分析,获取目标训练数据,包括:The non-volatile readable storage medium according to claim 15, wherein the performing peer analysis and port drift analysis on the original training data to obtain target training data comprises:
    对所述原始训练数据进行同行分析,获取同行特征;Perform peer analysis on the original training data to obtain peer characteristics;
    对所述原始训练数据进行口岸漂移分析,获取口岸漂移特征;Performing port drift analysis on the original training data to obtain port drift characteristics;
    基于所述同行特征和所述口岸漂移特征,获取中间训练数据;Obtaining intermediate training data based on the peer feature and the port drift feature;
    对所述中间训练数据进行缺失值处理和离散变量编码,获取目标训练数据。Performing missing value processing and discrete variable encoding on the intermediate training data to acquire target training data.
  17. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述对原始训练数据进行同行分析,包括:The non-volatile readable storage medium of claim 16, wherein the performing peer analysis on the raw training data comprises:
    获取所有高风险值的原始训练数据对应的历史出行时间;Obtain historical travel time corresponding to the original training data of all high risk values;
    对历史出行时间进行区间划分,获取同行特征。Interval division of historical travel time to obtain peer characteristics.
  18. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述对原始训练数据进行口岸漂移分析,获取口岸漂移特征,包括:The non-volatile readable storage medium according to claim 16, wherein the port drift analysis is performed on the original training data to obtain a port drift characteristic, including:
    统计所有高风险值的原始训练数据在预设时间内的出行次数和地点变更次数;Count the number of trips and the number of location changes of the original training data of all high-risk values within the preset time;
    采用公式S=Y/X对所述出行次数和地点变更次数进行计算,获取口岸漂移特征;其中,Y为所述出行地点变更次数,X为所述出行次数。The travel time and the number of location changes are calculated by using the formula S=Y/X to obtain the port drift feature; wherein Y is the number of travel location changes, and X is the travel time.
  19. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述采用决策树算法对训练集中的目标训练数据进行训练,获取原始风险模型,包括:The non-volatile readable storage medium according to claim 15, wherein the decision tree algorithm is used to train target training data in the training set to obtain an original risk model, including:
    初始化决策树算法对应的层级参数;Initializing the hierarchical parameters corresponding to the decision tree algorithm;
    采用CART算法对训练集中的目标训练数据进行训练,在决策树的生长层数达到所述层级参数时,获取所述原始风险模型;The target training data in the training set is trained by using the CART algorithm, and the original risk model is obtained when the number of growing layers of the decision tree reaches the level parameter;
    所述目标训练数据包括至少两个维度特征;The target training data includes at least two dimensional features;
    所述采用CART算法对训练集中的目标训练数据进行训练,在决策树的生长层数达到所述层级参数时,获取所述原始风险模型,包括:The CART algorithm is used to train the target training data in the training set, and when the growth layer of the decision tree reaches the hierarchical parameter, the original risk model is obtained, including:
    采用公式
    Figure PCTCN2018094183-appb-100005
    Figure PCTCN2018094183-appb-100006
    计算每一所述维度特征对应的基尼系数;其中,D为所述训练集,α为所述维度特征,P k为所述维度特征的概率;
    Adopt formula
    Figure PCTCN2018094183-appb-100005
    with
    Figure PCTCN2018094183-appb-100006
    Calculating a Gini coefficient corresponding to each of the dimensional features; wherein D is the training set, α is the dimensional feature, and P k is a probability of the dimensional feature;
    选取最小基尼系数对应的维度特征作为决策树的根节点;Select the dimension feature corresponding to the minimum Gini coefficient as the root node of the decision tree;
    基于所述决策树的根节点,重复执行计算维度特征所对应的基尼系数的步骤,直至决策树的生长层数达到层级参数的条件时,获取原始风险模型。Based on the root node of the decision tree, the step of calculating the Gini coefficient corresponding to the dimension feature is repeatedly performed until the growth layer of the decision tree reaches the condition of the hierarchical parameter, and the original risk model is obtained.
  20. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-transitory readable storage mediums storing computer readable instructions, wherein when the computer readable instructions are executed by one or more processors, cause the one or more processors to execute The following steps:
    获取待识别出行数据;Obtaining the travel data to be identified;
    将所述待识别出行数据输入到所述目标风险模型进行识别,获取风险识别结果;Inputting the to-be-identified travel data into the target risk model for identification, and acquiring a risk identification result;
    其中,所述目标风险模型是采用权利要求1-5任一项所述风险模型训练方法获取的模型。The target risk model is a model obtained by using the risk model training method according to any one of claims 1-5.
PCT/CN2018/094183 2018-03-26 2018-07-03 Risk model training method and apparatus, risk identification method and apparatus, device, and medium WO2019184119A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810250156.3 2018-03-26
CN201810250156.3A CN108549954B (en) 2018-03-26 2018-03-26 Risk model training method, risk identification device, risk identification equipment and risk identification medium

Publications (1)

Publication Number Publication Date
WO2019184119A1 true WO2019184119A1 (en) 2019-10-03

Family

ID=63516935

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094183 WO2019184119A1 (en) 2018-03-26 2018-07-03 Risk model training method and apparatus, risk identification method and apparatus, device, and medium

Country Status (2)

Country Link
CN (1) CN108549954B (en)
WO (1) WO2019184119A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126434A (en) * 2019-11-19 2020-05-08 山东省科学院激光研究所 Automatic microseism first arrival time picking method and system based on random forest
CN111695824A (en) * 2020-06-16 2020-09-22 深圳前海微众银行股份有限公司 Risk tail end client analysis method, device, equipment and computer storage medium
CN112184241A (en) * 2020-09-27 2021-01-05 ***股份有限公司 Identity authentication method and device
CN112508698A (en) * 2021-02-07 2021-03-16 北京淇瑀信息科技有限公司 User policy triggering method and device and electronic equipment
CN112749924A (en) * 2021-02-01 2021-05-04 深圳无域科技技术有限公司 Wind control model training method, system, equipment and computer readable medium
CN113673866A (en) * 2021-08-20 2021-11-19 上海寻梦信息技术有限公司 Crop decision method, model training method and related equipment
CN113837635A (en) * 2021-09-29 2021-12-24 支付宝(杭州)信息技术有限公司 Risk detection processing method, device and equipment
CN115346665A (en) * 2022-10-19 2022-11-15 南昌大学第二附属医院 Method, system and equipment for constructing retinopathy incidence risk prediction model
CN116579448A (en) * 2022-12-26 2023-08-11 北京码牛科技股份有限公司 Personnel contamination risk prediction method, system, intelligent terminal and storage medium

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222994A (en) * 2018-11-23 2020-06-02 泰康保险集团股份有限公司 Client risk assessment method, device, medium and electronic equipment
CN110033092B (en) * 2019-01-31 2020-06-02 阿里巴巴集团控股有限公司 Data label generation method, data label training device, event recognition method and event recognition device
CN110517154A (en) * 2019-07-23 2019-11-29 平安科技(深圳)有限公司 Data model training method, system and computer equipment
CN110399927B (en) * 2019-07-26 2022-02-01 玖壹叁陆零医学科技南京有限公司 Recognition model training method, target recognition method and device
CN111160733B (en) * 2019-12-16 2024-03-29 北京淇瑀信息科技有限公司 Risk control method and device based on biased sample and electronic equipment
CN111160797A (en) * 2019-12-31 2020-05-15 深圳市分期乐网络科技有限公司 Wind control model construction method and device, storage medium and terminal
CN111310784B (en) * 2020-01-14 2021-07-20 支付宝(杭州)信息技术有限公司 Resource data processing method and device
CN111400663B (en) * 2020-03-17 2022-06-14 深圳前海微众银行股份有限公司 Model training method, device, equipment and computer readable storage medium
CN113159175B (en) * 2021-04-21 2023-06-06 平安科技(深圳)有限公司 Data prediction method, device, equipment and storage medium
CN113139876B (en) * 2021-04-22 2024-06-18 平安壹钱包电子商务有限公司 Risk model training method, risk model training device, computer equipment and readable storage medium
CN113313417B (en) * 2021-06-23 2024-01-26 北京鼎泰智源科技有限公司 Method and device for classifying complaint risk signals based on decision tree model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101169716A (en) * 2007-11-30 2008-04-30 清华大学 Emulated procedure information modeling and maintenance method based on product structural tree
CN101226615A (en) * 2008-02-03 2008-07-23 北京航空航天大学 Business events process synergic modeling method based on role authority control
CN107222865A (en) * 2017-04-28 2017-09-29 北京大学 The communication swindle real-time detection method and system recognized based on suspicious actions

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150294246A1 (en) * 2014-04-10 2015-10-15 International Business Machines Corporation Selecting optimal training data set for service contract prediction
CN106127380A (en) * 2016-06-22 2016-11-16 北京拓明科技有限公司 A kind of big data risk analysis method
CN106503863A (en) * 2016-11-10 2017-03-15 北京红马传媒文化发展有限公司 Based on the Forecasting Methodology of the age characteristicss of decision-tree model, system and terminal
CN107730087A (en) * 2017-09-20 2018-02-23 平安科技(深圳)有限公司 Forecast model training method, data monitoring method, device, equipment and medium
CN107742193B (en) * 2017-11-28 2019-08-27 江苏大学 A kind of driving Risk Forecast Method based on time-varying state transition probability Markov chain

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101169716A (en) * 2007-11-30 2008-04-30 清华大学 Emulated procedure information modeling and maintenance method based on product structural tree
CN101226615A (en) * 2008-02-03 2008-07-23 北京航空航天大学 Business events process synergic modeling method based on role authority control
CN107222865A (en) * 2017-04-28 2017-09-29 北京大学 The communication swindle real-time detection method and system recognized based on suspicious actions

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126434A (en) * 2019-11-19 2020-05-08 山东省科学院激光研究所 Automatic microseism first arrival time picking method and system based on random forest
CN111695824A (en) * 2020-06-16 2020-09-22 深圳前海微众银行股份有限公司 Risk tail end client analysis method, device, equipment and computer storage medium
CN111695824B (en) * 2020-06-16 2024-03-29 深圳前海微众银行股份有限公司 Method, device, equipment and computer storage medium for analyzing risk tail end customer
CN112184241B (en) * 2020-09-27 2024-02-20 ***股份有限公司 Identity authentication method and device
CN112184241A (en) * 2020-09-27 2021-01-05 ***股份有限公司 Identity authentication method and device
CN112749924A (en) * 2021-02-01 2021-05-04 深圳无域科技技术有限公司 Wind control model training method, system, equipment and computer readable medium
CN112508698A (en) * 2021-02-07 2021-03-16 北京淇瑀信息科技有限公司 User policy triggering method and device and electronic equipment
CN112508698B (en) * 2021-02-07 2024-04-26 北京淇瑀信息科技有限公司 User policy triggering method and device and electronic equipment
CN113673866A (en) * 2021-08-20 2021-11-19 上海寻梦信息技术有限公司 Crop decision method, model training method and related equipment
CN113837635A (en) * 2021-09-29 2021-12-24 支付宝(杭州)信息技术有限公司 Risk detection processing method, device and equipment
CN115346665B (en) * 2022-10-19 2023-03-10 南昌大学第二附属医院 Method, system and equipment for constructing retinopathy incidence risk prediction model
CN115346665A (en) * 2022-10-19 2022-11-15 南昌大学第二附属医院 Method, system and equipment for constructing retinopathy incidence risk prediction model
CN116579448A (en) * 2022-12-26 2023-08-11 北京码牛科技股份有限公司 Personnel contamination risk prediction method, system, intelligent terminal and storage medium

Also Published As

Publication number Publication date
CN108549954B (en) 2022-08-02
CN108549954A (en) 2018-09-18

Similar Documents

Publication Publication Date Title
WO2019184119A1 (en) Risk model training method and apparatus, risk identification method and apparatus, device, and medium
WO2021032219A2 (en) Method and system for disease classification coding based on deep learning, and device and medium
WO2020220544A1 (en) Unbalanced data classification model training method and apparatus, and device and storage medium
CN107292330B (en) Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning
CN113011319A (en) Multi-scale fire target identification method and system
CN113326377B (en) Name disambiguation method and system based on enterprise association relationship
CN115699208A (en) Artificial Intelligence (AI) method for cleaning data to train AI models
CN113268370B (en) Root cause alarm analysis method, system, equipment and storage medium
CN105306296A (en) Data filter processing method based on LTE (Long Term Evolution) signaling
CN109409426A (en) A kind of extreme value gradient promotion logistic regression classification prediction technique
Wang et al. Application of fuzzy cluster analysis for medical image data mining
CN113674862A (en) Acute renal function injury onset prediction method based on machine learning
CN113283590A (en) Defense method for backdoor attack
WO2023201772A1 (en) Cross-domain remote sensing image semantic segmentation method based on adaptation and self-training in iteration domain
CN114743037A (en) Deep medical image clustering method based on multi-scale structure learning
Manna et al. Bird image classification using convolutional neural network transfer learning architectures
CN117079017A (en) Credible small sample image identification and classification method
CN111768803B (en) General audio steganalysis method based on convolutional neural network and multitask learning
CN116977725A (en) Abnormal behavior identification method and device based on improved convolutional neural network
CN115035966B (en) Superconductor screening method, device and equipment based on active learning and symbolic regression
CN117152528A (en) Insulator state recognition method, insulator state recognition device, insulator state recognition apparatus, insulator state recognition program, and insulator state recognition program
Marconi et al. Hyperbolic manifold regression
CN116152194A (en) Object defect detection method, system, equipment and medium
Xu et al. Hybrid label noise correction algorithm for medical auxiliary diagnosis
Wu et al. Deep learning in automatic fingerprint identification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18912438

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21/01/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18912438

Country of ref document: EP

Kind code of ref document: A1