WO2019184118A1 - 风险模型训练方法、风险识别方法、装置、设备及介质 - Google Patents

风险模型训练方法、风险识别方法、装置、设备及介质 Download PDF

Info

Publication number
WO2019184118A1
WO2019184118A1 PCT/CN2018/094178 CN2018094178W WO2019184118A1 WO 2019184118 A1 WO2019184118 A1 WO 2019184118A1 CN 2018094178 W CN2018094178 W CN 2018094178W WO 2019184118 A1 WO2019184118 A1 WO 2019184118A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
training data
risk
risk model
data
Prior art date
Application number
PCT/CN2018/094178
Other languages
English (en)
French (fr)
Inventor
金戈
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019184118A1 publication Critical patent/WO2019184118A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0248Avoiding fraud
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Definitions

  • the present application relates to the field of data prediction, and in particular, to a risk model training method, a risk identification method, a device, a device, and a medium.
  • the embodiment of the present application provides a risk model training method, a risk identification method, a device, a device, and a medium, so as to solve the problem that the current industry does not have a risk model for identifying publicity information uploaded by a public social platform.
  • the embodiment of the present application provides a risk model training method, including:
  • the target training data is trained by using a conditional random field algorithm to obtain a target risk model.
  • the embodiment of the present application provides a risk model training apparatus, including:
  • the original training data acquiring module is configured to acquire original training data of at least two organizations, and each original training data is associated with the organization identifier;
  • a positive and negative sample acquisition module configured to divide the original training data according to the same identifier according to the mechanism identifier, and obtain positive and negative samples
  • a target training data obtaining module configured to perform text vectorization processing on the positive and negative samples, and obtain target training data represented by vectorization
  • the target risk model acquisition module is configured to train the target training data by using a conditional random field algorithm to obtain a target risk model.
  • An embodiment of the present application provides a computer device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the computer readable instructions The following steps:
  • the target training data is trained by using a conditional random field algorithm to obtain a target risk model.
  • Embodiments of the present application provide one or more non-volatile readable storage media storing computer readable instructions, when executed by one or more processors, causing the one or more processors Perform the following steps:
  • the target training data is trained by using a conditional random field algorithm to obtain a target risk model.
  • the embodiment of the present application provides a risk identification method, including:
  • the risk identification probability is greater than the preset probability, determining that the to-be-identified data is high-risk data.
  • the embodiment of the present application provides a risk identification apparatus, including:
  • a data acquisition module to be identified, configured to acquire data to be identified corresponding to the organization identifier
  • a risk identification probability obtaining module configured to input the data to be identified into a target risk model corresponding to the organization identifier, and obtain a risk identification probability, wherein the target risk model is trained by using a first aspect risk model training method Post-acquired model;
  • the high-risk data determining module is configured to determine that the to-be-identified data is high-risk data if the risk identification probability is greater than a preset probability.
  • An embodiment of the present application provides a computer device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the computer readable instructions The following steps:
  • the risk identification probability is greater than the preset probability, determining that the to-be-identified data is high-risk data.
  • Embodiments of the present application provide one or more non-volatile readable storage media storing computer readable instructions, when executed by one or more processors, causing the one or more processors Perform the following steps:
  • the risk identification probability is greater than the preset probability, determining that the to-be-identified data is high-risk data.
  • FIG. 1 is a flowchart of a risk model training method provided in Embodiment 1 of the present application.
  • FIG. 2 is a specific schematic view of step S13 of Figure 1;
  • FIG. 3 is a specific schematic view of step S132 of Figure 2;
  • FIG. 4 is a specific schematic view of step S14 of Figure 1;
  • FIG. 5 is a schematic block diagram of a risk model training apparatus provided in Embodiment 2 of the present application.
  • FIG. 6 is a flowchart of a risk identification method provided in Embodiment 3 of the present application.
  • FIG. 7 is a schematic block diagram of a risk identification device provided in Embodiment 4 of the present application.
  • FIG. 8 is a schematic diagram of a computer device provided in Embodiment 6 of the present application.
  • FIG. 1 shows a flow chart of a risk model training method in this embodiment.
  • the risk model training method is applied on a social platform, or a financial institution such as a bank, a securities, an insurance, or another institution that needs to perform risk identification, and is used to train a risk model in a specific domain, so as to The publicity information published in the specific field published by the internal communication platform of the organization carries out risk identification and achieves the purpose of self-locking the risk source.
  • the risk model training method includes the following steps:
  • S11 Acquire original training data of at least two institutions, and each original training data is associated with an organization identifier.
  • the original training data includes, but is not limited to, corpus data in a specific domain corpus.
  • the specific field in this embodiment refers specifically to the field of insurance, and the domain-specific corpus specifically refers to a text library with the theme of insurance business.
  • Corpus data refers to linguistic material data that has actually appeared in the actual use of the language.
  • the organization identification is a unique identification for identifying the organization data, and the organization identification includes the target organization identification and the non-target organization identification.
  • the target organization identifier in this embodiment refers to the identifier of the institution that needs to train the risk model, that is, the identifier of the target organization.
  • the original training data includes the corpus data of the target institution and the corpus data of the non-target organization.
  • the corpus data of the Ping An Insurance obtained is the corpus data of the target institution, and the life insurance.
  • the corpus data of insurance or other insurance institutions are corpus data of non-target organizations. It can be understood that the corpus data of other non-insurance institutions (such as banks) can also be used as corpus data of non-target organizations.
  • Model training is supported by associating each raw training data with an organization identification to enable subsequent division of the original training data based on the organization identification.
  • the positive sample refers to the original training data carrying the target organization identifier
  • the negative sample refers to the original training data carrying the non-target organization identifier.
  • the original training data is divided according to the organization identifier according to the same ratio (1:1), that is, the original training data corresponding to the target organization identifier and the original training data corresponding to the non-target organization identifier are divided into equal proportions, and the same can be obtained.
  • Positive and negative samples can effectively prevent the model from over-fitting, so that the recognition effect of the risk model obtained through positive and negative sample training is more accurate.
  • S13 Perform text vectorization processing on the positive and negative samples to obtain target training data represented by vectorization.
  • text vectorization processing refers to a process of vectorizing representation of text. Specifically, since the model cannot directly calculate words or words, when training the original training data, the original training data needs to be subjected to text vectorization processing to obtain the target training data represented by the vectorization, so as to perform the risk model. training.
  • the target risk model is a model with high accuracy obtained by training the target training data with the conditional random field algorithm.
  • the target risk model is associated with an organization identifier, so that when the target risk model is subsequently used for risk identification, the corresponding target risk model can be obtained based on the organization identifier query.
  • the conditional random field (CRF) algorithm is an algorithm for the conditional probability distribution of another set of output random variables under the condition of a set of input random variables. It is characterized by the assumption that the output random variables constitute a Markov random field.
  • the airport not only has the advantages of discriminant model, but also has the advantages of considering the transition probability between context markers in the production model, the characteristics of global parameter optimization and decoding in serialized form, and solving the label bias that is difficult to avoid by other discriminant models. Set the problem.
  • the Discriminative Model directly models the conditional probability p(y
  • the Generative Model models the joint distribution p(x, y) of x and y.
  • the original training data of at least two mechanisms is first acquired, and each original training data is associated with the organization identifier, so that the original training data is divided according to the same ratio according to the organization identifier, and positive and negative samples are obtained, which can effectively prevent model training. Over-fitting, so that the recognition of the risk model obtained through positive and negative sample training is more accurate. Then, the text vectorization processing is performed on the positive and negative samples to obtain the target training data represented by the vectorization, so as to reduce the calculation amount of the model training and improve the efficiency of the model training when the model training is performed based on the target training data.
  • conditional random field algorithm is used to train the target training data to obtain the target risk model, so that the target risk identification model has the advantages of the generated model (that is, taking into account the advantages of the transition probability between context markers), and solves other problems.
  • the discriminant model is difficult to avoid the problem of mark bias and improve the accuracy of model recognition.
  • step S13 text vectorization processing is performed on the positive and negative samples, and the target training data represented by the vectorization is obtained, which specifically includes the following steps:
  • S131 Perform the word segmentation and de-stop word processing on the positive and negative samples by using the staging word segmentation tool to obtain at least one word.
  • stop word processing refers to the process of automatically filtering out some stop words before or after processing natural language data in order to save storage space and improve search efficiency in information retrieval.
  • Word segmentation refers to the process of segmenting words in a sentence according to a dictionary.
  • the word is the word element obtained after the word segmentation of the positive and negative samples.
  • the positive sample is the original training data corresponding to the target organization identification
  • the negative sample is the original training data corresponding to the non-target organization identification.
  • Chinese and/or English may appear.
  • the Chinese and English characters are different in the word segmentation operation. Therefore, the original training data needs to be distinguished between Chinese and English before the word segmentation.
  • the method for distinguishing the original training data in Chinese and English includes, but is not limited to, a regular expression.
  • a regular expression is a logical formula for string operations. It refers to a specific string of characters defined in advance or a combination of these specific characters to form a "rule string”. This "rule string" is used to express A filtering logic for strings.
  • the method for distinguishing Chinese and English by using regular expressions is as follows: the regular expression matching Chinese characters is [u4e00-u9fa5], and the regular expression matching English characters is [a-zA-Z].
  • the regular expression based on Chinese characters and the regular expression of English characters are used to distinguish the original training data in Chinese and English to obtain the corresponding distinguishing text (including Chinese characters and English characters), so that the word segmentation operation can be performed quickly when the word segmentation is performed later. Improve the efficiency of model training.
  • the method for performing word segmentation on positive and negative samples includes, but is not limited to, segmentation of Chinese characters of positive and negative samples by using a staging word segmentation tool.
  • the stuttering word segmentation tool is a commonly used Chinese analysis tool, which can effectively extract the words in the sentence one by one, and has the advantages of high accuracy and high efficiency.
  • the stop word dictionary is configured in the stutter word segmentation tool, and the stop word process can be performed on the positive and negative samples based on the stop word dictionary to exclude stop words (such as “I”, “one”, “down”). Interference, reduce the amount of computation of model training, and improve the efficiency of model training.
  • the staging word segmentation tool is a tool for segmenting Chinese characters
  • the English characters can be mapped to the English characters by using a pre-stored Chinese-English comparison table to obtain Chinese characters, and then the word segmentation tool is used for segmentation.
  • the word segmentation tool is used for segmentation.
  • S132 Perform vectorization processing on at least one word to obtain target training data represented by vectorization.
  • the target training data is text data obtained by vectorizing at least one word.
  • the TDF-IF algorithm is used to calculate the weight of each word in the original training data, and is used as a dimension of the vector to realize vectorized representation of at least one word and obtain target training data.
  • the training efficiency of the model is accelerated.
  • the stagnation word segmentation tool is used to perform segmentation and de-stop word processing on the positive and negative samples, and at least one word is acquired to improve the accuracy and training efficiency of the model.
  • the English characters in the Chinese and English comparison tables can be used to map the English characters, and the Chinese characters can be converted, so that the Chinese characters can be segmented by the word segmentation tool to improve the generalization ability of the model.
  • at least one word is vectorized to obtain target training data, which provides convenience for the input of the subsequent risk model training.
  • step S132 the vectorization processing is performed on at least one word to obtain the target training data represented by the vectorization, which specifically includes the following steps:
  • S1321 Perform at least one word operation by using the TF-IDF algorithm to obtain a word frequency corresponding to each word.
  • the TF-IDF (term frequency–inverse document frequency) algorithm is a commonly used weighting algorithm for information retrieval and data mining, which has the advantages of simple calculation and high efficiency.
  • each word is operated by using the TF-IDF algorithm to obtain the number of occurrences of each word in the original training data, that is, the word frequency.
  • the calculation formula of the TF-IDF algorithm is Where u is the number of occurrences of the word in the original training data, U is the total number of words in the original training data, and T is the word frequency.
  • the TF-IDF algorithm is used to calculate at least one word, and the word frequency corresponding to each word is obtained, and the calculation process is simple, which is beneficial to improving the training efficiency of the risk model.
  • the word frequency corresponding to each word is taken as one dimension of the vector, and the target training data represented by the vector is acquired.
  • the original training data is “insurance period -1 year”
  • the words obtained after segmentation of the original training data are “insurance”, “term”, “1 year”, and the words calculated by step S1321 are assumed
  • the word frequency of "insurance", “term”, “1 year” is 0.2, 0.3, and 0.4
  • the target training data obtained by vectorizing the word is (0.2, 0.3, 0.4) for input model. Training to improve the training efficiency of the risk model.
  • the TF-IDF algorithm is first used to calculate each word order to obtain the number of occurrences of each word in the original training data, that is, the word frequency, which is easy to calculate, and is beneficial to improving the training efficiency of the risk model. Then, the word frequency corresponding to each word is taken as a dimension of the vector, and the target training data represented by the vector is obtained, so that the input model is trained to further improve the training efficiency of the risk model.
  • step S14 the target random training algorithm is used to train the target training data to obtain the target risk model, which specifically includes the following steps:
  • S141 Calculating target training data by using a maximum likelihood estimation algorithm to obtain an original risk model.
  • the maximum likelihood estimation algorithm is the result of using known samples, and based on the use of a certain model, the estimation algorithm of the model parameter values most likely to cause such a result is reversed. Since the algorithm utilizes the form of the distribution function, it has the advantage of obtaining a higher estimation accuracy.
  • the conditional random field model is Where w k represents the weight of the feature function and Z(x) represents the normalization factor.
  • the above formula represents the conditional probability predicted for the output sequence y (ie, the organization identifier) for a given input sequence x (ie, corpus data in the target training data).
  • f k represents a feature function
  • the feature function usually takes a value of 1 or 0; when the feature condition is satisfied, the value is 1; otherwise, it is 0.
  • the maximum likelihood estimation algorithm is used to estimate the model parameters of the conditional random field.
  • f k denotes a feature function
  • ⁇ k denotes a weight corresponding to the feature function, that is, a parameter w k in the conditional random field model
  • S142 The original risk model is optimized by using a gradient descent algorithm to obtain a target risk model.
  • Gradient Descent also known as the steepest descent algorithm
  • the gradient risk reduction algorithm is used to perform multiple iterative derivation optimization solutions on the original risk model, and the minimum loss function and model parameter values are obtained, that is, the required model parameters are obtained when the iterative optimization is performed to multiple derivatives.
  • based on this model parameter, obtain the target risk model.
  • the maximum likelihood function in step S141 is derived, and the calculation formula is among them, Represents a regularization term, which is a penalty function that "punishes" the model vector to avoid overfitting problems.
  • the regularization term is essentially a priori information.
  • the gradient descent algorithm is used to optimize the model parameters in the original risk model to obtain the target risk model, and the gradient descent algorithm is simple and easy to implement.
  • conditional random field model is firstly subjected to logarithm operation to obtain a likelihood function, and then the maximum likelihood estimation algorithm is used to estimate the model parameters of the conditional random field model, which is utilized by the maximum likelihood estimation algorithm.
  • the distribution function form has the advantage of obtaining a higher estimation accuracy to improve the accuracy of the risk model.
  • the gradient descent algorithm is used to optimize the model parameters of the original risk model, and the target risk model is obtained to simplify the steps of model calculation and improve the efficiency of model training.
  • the original training data of at least two mechanisms is first acquired, and each original training data is associated with the organization identifier, so that the original training data is divided according to the same ratio according to the organization identifier, and positive and negative samples are obtained, which can effectively prevent model training. Over-fitting, so that the recognition of the risk model obtained through positive and negative sample training is more accurate. Then, using the staging word segmentation tool to process the word segmentation and de-stop words for positive and negative samples, and obtain at least one word to improve the accuracy and training efficiency of the model.
  • the Chinese and English comparison tables can be used to map the distinguished English characters, and the converted Chinese characters can be obtained, so that the Chinese characters can be segmented by using the staging word segmentation tool to improve the generalization ability of the model.
  • the TF-IDF algorithm is used to calculate each word to obtain the number of occurrences of each word in the original training data, that is, the word frequency, which is easy to calculate, which is beneficial to improve the training efficiency of the risk model.
  • the word frequency corresponding to each word is taken as a dimension of the vector, and the target training data represented by the vector is obtained, so that the input model is trained to further improve the training efficiency of the risk model, and the input of the subsequent risk model training is facilitated.
  • conditional random field algorithm is used to train the target training data to obtain the target risk model, so that the target risk identification model has the advantages of the generative model, that is, taking into account the advantages of the transition probability between context markers, and solving other discriminants.
  • the model biasing problem is difficult to avoid, and the accuracy of model recognition is improved.
  • Fig. 5 is a block diagram showing the principle of the risk model training device corresponding to the risk model training method of the first embodiment.
  • the risk model training device includes an original training data acquisition module 11, a positive and negative sample acquisition module 12, a target training data acquisition module 13, and a target risk model acquisition module 14.
  • the implementation functions of the original training data acquisition module 11, the positive and negative sample acquisition module 12, the target training data acquisition module 13, and the target risk model acquisition module 14 correspond to the steps corresponding to the risk model training method in the embodiment, in order to avoid redundancy. This embodiment is not described in detail.
  • the original training data obtaining module 11 is configured to acquire original training data of at least two organizations, and each original training data is associated with an organization identifier.
  • the positive and negative sample acquisition module 12 is configured to divide the original training data according to the organization identifier according to the same ratio, and obtain positive and negative samples.
  • the target training data obtaining module 13 is configured to perform text vectorization processing on the positive and negative samples to obtain target training data represented by the vectorization.
  • the target risk model obtaining module 14 is configured to train the target training data by using a conditional random field algorithm to obtain a target risk model.
  • the target training data acquisition module 13 includes a word acquisition unit 131 and a target training data acquisition unit 132.
  • the word acquisition unit 131 is configured to perform word segmentation and de-stop word processing on the positive and negative samples by using the staging word segmentation tool to obtain at least one word.
  • the target training data acquiring unit 132 is configured to perform vectorization processing on at least one word to obtain target training data represented by the vectorization.
  • the target training data acquisition unit 132 includes a word frequency acquisition sub-unit 1321 and a target training data acquisition sub-unit 1322.
  • the word frequency acquisition sub-unit 1321 is configured to perform at least one word operation by using the TF-IDF algorithm to obtain a word frequency corresponding to each word.
  • the target training data acquisition sub-unit 1322 is configured to obtain the target training data represented by the vector form by using the word frequency corresponding to each word as the dimension of the vector.
  • the target risk model acquisition module 14 includes an original risk model acquisition unit 141 and a target risk model acquisition unit 142.
  • the original risk model obtaining unit 141 is configured to use the original risk model obtaining unit 141 to calculate the target training data by using a maximum likelihood estimation algorithm to obtain an original risk model.
  • the target risk model obtaining unit 142 is configured to optimize the original risk model by using a gradient descent algorithm to obtain a target risk model.
  • Fig. 6 is a flow chart showing the risk model training method in this embodiment.
  • the risk model training method is applied on a social platform, or a financial institution such as a bank, a securities, an insurance, or another institution that needs to perform risk identification, so as to adopt a target risk model for a specific field published by a user on a social platform or an intra-institutional communication platform. Propaganda information for risk identification, to achieve the purpose of self-locking risk sources.
  • the risk model training method includes the following steps:
  • S21 Acquire data to be identified corresponding to the organization identifier, where the data to be identified is associated with the user ID.
  • the data to be identified is a real-time data collected by a crawler tool and published on a social platform or an internal communication platform of the organization to identify whether there is a risk.
  • the user ID is a unique identifier for identifying a user, and the user ID may be a user account that the user logs into the social platform or the internal communication platform of the organization.
  • the data to be identified is specifically related data in the insurance field. Specifically, the data to be identified corresponding to the identifier of the organization is obtained, and the data to be identified is associated with the user ID, that is, the data publicly published by the user on the social platform or the internal communication platform of the organization is the data to be identified, and the corresponding identifier corresponding to the organization identifier is called.
  • the target risk model identifies the data to be identified to determine the risk of the data to be identified.
  • the to-be-identified data may be crawled from the data published on the social platform or the intra-institution communication platform by using a crawler tool to obtain the to-be-identified data associated with the organization identifier.
  • the crawler tool includes, but is not limited to, a ForeSpider data acquisition software.
  • ForeSpider data acquisition software is a visual universal crawler software that can be acquired through a simple two-step configuration operation. The software also comes with a free database that can be collected directly into the warehouse.
  • There is a built-in browser in ForeSpider You can log in by entering the account and password at the end of the browser. You can also set up automatic login to log in automatically when you crawl the next time, and get the data to be identified in real time to achieve real-time risk control.
  • S22 Input the data to be identified into the target risk model for identification, and obtain a risk identification probability.
  • the data to be identified is input into a target risk model corresponding to the organization identifier, and the input data to be identified is calculated in the target risk model, and the risk identification probability is output. Specifically, after acquiring the data to be identified by the user, the data to be identified is calculated in a target risk model corresponding to the organization identifier, and the risk identification probability is obtained.
  • the recognition probability may be a real number between 0-1.
  • the preset probability is a preset probability for evaluating whether the data to be identified associated with the user is at risk.
  • the recognition probability obtained by processing the data to be identified in the target risk model is compared with a preset probability. If the recognition probability is greater than the preset probability, it is determined that the data to be identified is high risk data. If the recognition probability is less than or equal to the preset probability, the data to be identified is low risk data.
  • the to-be-identified data is associated with the user ID, and the user ID is associated with the organization identifier. If it is determined that the data to be identified is high-risk data, the user is a high-risk user, that is, a high-risk user. For example, if the user is an employee of a bank, securities, insurance, or other financial institution corresponding to the organization identifier or other institution that needs to identify the risk, the employee posts a data to be identified on the social platform or the internal communication platform of the institution by using the user ID as the login account.
  • the target risk model corresponding to the organization identifier identifying, by the target risk model corresponding to the organization identifier, the data to be identified to determine whether the data to be identified is the real corpus data of the target institution corresponding to the organization identifier; if yes, the promotion of the employee's target organization
  • the information ie the identification data
  • the employee disseminates the publicity information (that is, the identification data) of the non-target organization (ie, other organizations), and can determine whether the employee has a job-hopping intention, and is a high-risk user who is leaving the company to facilitate internal personnel management.
  • the crawler tool is used to perform real-time crawling from the public data to obtain the data to be identified associated with the organization identifier, and the effect of real-time wind control is achieved, and then the data to be identified is performed in the target risk model. Calculate and obtain the risk identification recognition probability. Finally, the risk identification probability is judged. If the risk identification probability is greater than the preset probability, the data to be identified is high risk data, so as to identify the risk of the data to be identified transmitted by the user on the public social platform or the internal communication platform of the institution.
  • Fig. 7 is a block diagram showing the principle of the risk identification device corresponding to the risk identification method in the third embodiment.
  • the risk identification device includes a data identification module 21 to be identified, a risk identification probability acquisition module 22, and a high risk data determination module 23.
  • the implementation functions of the to-be-identified data acquisition module 21, the risk identification probability acquisition module 22, and the high-risk data determination module 23 are in one-to-one correspondence with the steps corresponding to the risk identification method in the third embodiment. To avoid redundancy, the present embodiment does not. Detailed.
  • the to-be-identified data acquisition module 21 is configured to acquire data to be identified corresponding to the organization identifier.
  • the risk identification probability obtaining module 22 is configured to input the data to be identified into a target risk model corresponding to the organization identifier to obtain a risk identification probability, and the target risk model is a model acquired after training by using the risk model training method in Embodiment 1. .
  • the high risk data determining module 23 is configured to determine that the data to be identified is high risk data if the risk identification probability is greater than a preset probability.
  • the embodiment provides one or more non-volatile readable storage media having computer readable instructions that, when executed by one or more processors, cause the one or more processors to execute
  • the risk model training method in Embodiment 1 is implemented. To avoid repetition, details are not described herein again.
  • the computer readable instructions are executed by one or more processors, such that the one or more processors execute to implement the functions of the modules/units in the risk model training device of Embodiment 2, in order to avoid duplication, No further details are provided herein; or, when the computer readable instructions are executed by one or more processors, the one or more processors are executed to implement the risk identification method in Embodiment 3, and in order to avoid duplication, no longer Or the computer readable instructions are executed by one or more processors such that when executed by the one or more processors, the functions of the modules/units in the risk identification device of Embodiment 4 are implemented, in order to avoid duplication , no longer repeat them here.
  • FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
  • computer device 80 of this embodiment includes a processor 81, a memory 82, and computer readable instructions 83 stored in memory 82 and executable on processor 81.
  • the processor 81 executes the computer readable instructions 83, the steps of the risk model training method in the first embodiment are implemented. To avoid repetition, details are not described herein.
  • the processor 81 executes the computer readable instructions 83, the functions of the modules/units in the risk model training device in the second embodiment are implemented. To avoid repetition, the details are not described herein; or the processor 81 performs computer readable operations.

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Game Theory and Decision Science (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Educational Administration (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

本申请公开一种风险模型训练方法、风险识别方法、装置、设备及介质,该风险模型训练方法包括:获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联;基于所述机构标识按照同等比例对所述原始训练数据进行划分,获取正负样本;对所述正负样本文本向量化处理,获取向量化表示的目标训练数据;采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型。该风险模型训练方法有效解决当前业内无法对公共平台中用户所发表的数据的安全性进行识别的问题。

Description

风险模型训练方法、风险识别方法、装置、设备及介质
本专利申请以2018年3月26日提交的申请号为201810250165.2,名称为“风险模型训练方法、风险识别方法、装置、设备及介质”的中国发明专利申请为基础,并要求其优先权。
技术领域
本申请涉及数据预测领域,尤其涉及一种风险模型训练方法、风险识别方法、装置、设备及介质。
背景技术
随着互联网技术的发展,越来越多的用户***台上发表或传播宣传信息,例如用于进行业务宣传的宣传广告等。由于公共社交平台不能对用户上传的宣传信息进行审核,使得通过公共社交平台传播的宣传信息的风险性无法估计,即无法评估这些宣传信息的真实性,其他用户误信这些宣传信息的描述而进行相应的操作,可能导致财产损失。例如,一保险机构的业务人员A可能通过一公共社交平台发表某一保险的宣传广告,以吸引客户购买相关保险,如果该业务人员A上传的虚假的宣传广告,而客户B基于该虚假的宣传广告购买了保险,可能对客户B造成财产损失。当前业内还没有针对特定领域(如保险领域)的用于识别风险的风险模型,无法识别公共社交平台上的宣传信息的风险,使得公共社交平台上传播的宣传信息可能会导致其他用户财产损失。
发明内容
本申请实施例提供一种风险模型训练方法、风险识别方法、装置、设备及介质,以解决当前业内没有针对公共社交平台上传的宣传信息进行识别的风险模型的问题。
本申请实施例提供一种风险模型训练方法,包括:
获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联;
基于所述机构标识按照同等比例对所述原始训练数据进行划分,获取正负样本;
对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据;
采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型。
本申请实施例提供一种风险模型训练装置,包括:
原始训练数据获取模块,用于获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联;
正负样本获取模块,用于基于所述机构标识按照同等比例对所述原始训练数据进行划分,获取正负样本;
目标训练数据获取模块,用于对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据;
目标风险模型获取模块,用于采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型。
本申请实施例提供一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联;
基于所述机构标识按照同等比例对所述原始训练数据进行划分,获取正负样本;
对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据;
采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型。
本申请实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联;
基于所述机构标识按照同等比例对所述原始训练数据进行划分,获取正负样本;
对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据;
采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型。
本申请实施例提供一种风险识别方法,包括:
获取与机构标识相对应的待识别数据;
将所述待识别数据输入到与所述机构标识相对应的目标风险模型进行识别,获取风险识别概率,所述目标风险模型是采用第一方面风险模型训练方法训练后获取的模型;
若风险识别概率大于预设概率,则判定所述待识别数据为高风险数据。
本申请实施例提供一种风险识别装置,包括:
待识别数据获取模块,用于获取与机构标识相对应的待识别数据;
风险识别概率获取模块,用于将所述待识别数据输入到与所述机构标识相对应的目标风险模型进行识别,获取风险识别概率,所述目标风险模型是采用第一方面风险模型训练 方法训练后获取的模型;
高风险数据判定模块,用于若风险识别概率大于预设概率,则判定所述待识别数据为高风险数据。
本申请实施例提供一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取与机构标识相对应的待识别数据;
将所述待识别数据输入到与所述机构标识相对应的目标风险模型进行识别,获取风险识别概率,所述目标风险模型是采用所述风险模型训练方法训练后获取的模型;
若风险识别概率大于预设概率,则判定所述待识别数据为高风险数据。
本申请实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取与机构标识相对应的待识别数据;
将所述待识别数据输入到与所述机构标识相对应的目标风险模型进行识别,获取风险识别概率,所述目标风险模型是采用权利要求所述风险模型训练方法训练后获取的模型;
若风险识别概率大于预设概率,则判定所述待识别数据为高风险数据。
本申请的一个或多个实施例的细节在下面的附图及描述中提出。本申请的其他特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例1中提供的风险模型训练方法的一流程图;
图2是图1中步骤S13的一具体示意图;
图3是图2中步骤S132的一具体示意图;
图4是图1中步骤S14的一具体示意图;
图5是本申请实施例2中提供的风险模型训练装置的一原理框图;
图6是本申请实施例3中提供的风险识别方法的一流程图;
图7是本申请实施例4中提供的风险识别装置的一原理框图;
图8是本申请实施例6中提供的计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
实施例1
图1示出本实施例中风险模型训练方法的流程图。该风险模型训练方法应用在社交平台上,或者银行、证券、保险等金融机构或需要进行风险识别的其他机构上,用于训练特定领域的风险模型,以便基于该风险模型对用户通过社交平台或者机构内部通信平台发表的特定领域的宣传信息进行风险识别,达到自主锁定风险源的目的。如图1所示,该风险模型训练方法包括如下步骤:
S11:获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联。
其中,原始训练数据包括但不限于特定领域语料库中的语料数据。本实施例中的特定领域具体指保险领域,特定领域语料库具体指以保险业务为主题的文本库。语料数据是指在语言的实际使用中真实出现过的语言材料数据。机构标识是用于识别机构数据的唯一标识,该机构标识包括目标机构标识和非目标机构标识。本实施例中的目标机构标识是指需要训练风险模型的机构的标识,即目标机构的标识。具体地,原始训练数据包括目标机构的语料数据和非目标机构的语料数据,例如,在需要训练平安保险机构的风险模型时,获取到的平安保险的语料数据为目标机构的语料数据,而人寿保险或者其他保险机构的语料数据为非目标机构的语料数据。可以理解地,其他非保险机构(如银行)的语料数据也可以作为非目标机构的语料数据。通过将每一原始训练数据与机构标识相关联,以使后续能够基于机构标识对原始训练数据进行划分,为模型训练提供支持。
S12:基于机构标识按照同等比例对原始训练数据进行划分,获取正负样本。
其中,正样本是指携带有目标机构标识的原始训练数据,负样本是指携带有非目标机构标识的原始训练数据。本实施例中,根据机构标识对原始训练数据按照同等比例进行划分(1:1),即目标机构标识对应的原始训练数据与非目标机构标识对应的原始训练数据按同等比例划分,即可获取正负样本,能够有效防止模型训练过拟合的情况,以使通过正负样本训练获得的风险模型的识别效果更加精准。
S13:对正负样本进行文本向量化处理,获取向量化表示的目标训练数据。
其中,文本向量化处理是指对文本进行向量化表示的处理。具体地,由于模型是不能直接对词或字进行计算,因此在对原始训练数据进行训练时,需要对原始训练数据进行文本向量化处理,以获取向量化表示的目标训练数据,以便进行风险模型训练。
S14:采用条件随机场算法对目标训练数据进行训练,获取目标风险模型。
其中,目标风险模型是采用条件随机场算法对目标训练数据进行训练所获取到的准确率较高的模型。该目标风险模型与一机构标识相关联,以便后续采用该目标风险模型进行风险识别时,可基于该机构标识查询获取到对应的目标风险模型。
条件随机场(conditional random field,CRF)算法是给定一组输入随机变量条件下另一组输出随机变量的条件概率分布的算法,其特点是假设输出随机变量构成马尔可夫随机场,条件随机场既具有判别式模型的优点,又具有产生式模型中的考虑上下文标记间的转移概率,以序列化形式进行全局参数优化和解码的特征的优点,解决了其他判别式模型难以避免的标记偏置问题。判别式模型(Discriminative Model)是直接对条件概率p(y|x;θ)建模。产生式模型(Generative Model)则会对x和y的联合分布p(x,y)建模。
本实施例中,先获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联,以便基于机构标识对原始训练数据按照同等比例进行划分,获取正负样本,能够有效防止模型训练过拟合的情况,以使通过正负样本训练获得的风险模型的识别效果更加精准。然后,对正负样本进行文本向量化处理,获取向量化表示的目标训练数据,以便基于目标训练数据进行模型训练时,减少模型训练的计算量,提高模型训练的效率。最后,采用条件随机场算法对目标训练数据进行训练,获取目标风险模型,以使该目标风险识别模型具有生成式模型的优点(即考虑到上下文标记间的转移概率的优点),并解决了其他判别式模型难以避免的标记偏置问题,提高模型识别的准确率。
在一具体实施方式中,如图2所示,步骤S13中,即对正负样本进行文本向量化处理,获取向量化表示的目标训练数据,具体包括如下步骤:
S131:采用结巴分词工具对正负样本进行分词和去停用词处理,获取至少一个词次。
其中,停用词处理是指在信息检索中,为节省存储空间和提高搜索效率,在处理自然语言数据之前或之后会自动过滤掉某些停用词的处理。分词处理是指按照词典将断句中的词进行切分的处理。词次是对正负样本进行分词后所得到的词元素。正样本是与目标机构标识相对应的原始训练数据,而负样本是与非目标机构标识相对应的原始训练数据。在原始训练数据中,可能会出现中文和/或英文,在后续进行分词时,中文字符和英文字符的 分词操作是不同的,因此需要在进行分词之前还需对原始训练数据进行中英文区分。
本实施例中,对原始训练数据进行中英文区分的方法包括但不限于正则表达式。其中,正则表达式是对字符串操作的一种逻辑公式,是指用事先定义好的一些特定字符或者这些特定字符的组合,组成一个“规则字符串”,这个“规则字符串”用来表达对字符串的一种过滤逻辑。具体地,采用正则表达式对中英文进行区分的方法如下:匹配中文字符的正则表达式为[u4e00-u9fa5],匹配英文字符的正则表达式为[a-zA-Z]。基于中文字符的正则表达式和英文字符的正则表达式对原始训练数据进行中英文区分,以获取对应的区分文本(包括中文字符和英文字符),以使后续进行分词时能够快速的进行分词操作,提高模型训练的效率。
本实施例中,对正负样本进行分词的方法包括但不限于采用结巴分词工具对正负样本的中文字符进行分词。结巴分词工具是一种常用的中文分析工具,它可以有效地将句子里的词语一个个的提取出来,具有准确率高、效率高的优点。具体地,结巴分词工具中配置有停用词词典,还可以基于该停用词词典对正负样本进行去停用词处理,以排除停用词(如“我”“个”“下”)干扰,减少模型训练的计算量,提高模型训练的效率。
本实施例中,由于结巴分词工具是对中文字符进行切分的工具,因此对于英文字符可以采用预先存储的中英文对照表对英文字符进行映射,获取中文字符,然后采用结巴分词工具进行分词,提高模型的泛化能力。
S132:对至少一个词次进行向量化处理,获取向量化表示的目标训练数据。
其中,目标训练数据是对至少一个词次进行向量化处理得到的文本数据。具体地,采用TDF-IF算法对每一个词次在原始训练数据中的权值进行计算,并将其作为向量的一个维度,以实现对至少一个词次进行向量化表示,获取目标训练数据,以方便模型的训练,加快模型的训练效率。
本实施例中,采用结巴分词工具对正负样本进行分词和去停用词处理,获取至少一个词次,以提高模型的准确率和训练效率。在进行分词之前,还可采用中英文对照表对区分出来的英文字符进行映射,获取转换中文字符,以便采用结巴分词工具对转换中文字符进行分词,以提高模型的泛化能力。最后,对至少一个词次进行向量化处理,获取目标训练数据,为后续风险模型训练的输入提供方便。
在一具体实施方式中,如图3所示,步骤S132中,即对至少一个词次进行向量化处理,获取向量化表示的目标训练数据,具体包括如下步骤:
S1321:采用TF-IDF算法对至少一个词次进行运算,获取每一词次对应的词频。
其中,TF-IDF(term frequency–inverse document frequency)算法是一种用于信息检索与数据挖掘的常用加权算法,具有计算简单,效率快的优点。具体地,采用TF-IDF算法对每一个词次进行运算,以获取每一个词次在原始训练数据中的出现次数,即为词频。TF-IDF算法的计算公式为
Figure PCTCN2018094178-appb-000001
其中,u表示词次在原始训练数据中的出现次数,U表示原始训练数据中的总词次,T为词频。本实施例中,采用TF-IDF算法对至少一个词次进行运算,获取每一词次对应的词频,计算过程简单,有利于提高风险模型的训练效率。
S1322:将每一词次对应的词频作为向量的维度,获取以向量形式表示的目标训练数据。
具体地,将每一个词次对应的词频作为向量的一个维度,获取以向量表示的目标训练数据。例如,原始训练数据为“保险期限-1年”,将原始训练数据进行分词后得到的词次为“保险”、“期限”、“1年”,假设通过步骤S1321计算出的各词次(“保险”、“期限”、“1年”)的词频依序为0.2、0.3和0.4,则将词次进行向量化处理得到的目标训练数据为(0.2,0.3,0.4),以便输入模型进行训练,从而提高风险模型的训练效率。
本实施例中,先采用TF-IDF算法对每一个词次进行运算,以获取每一个词次在原始训练数据中的出现次数即词频,容易计算,有利于提高风险模型的训练效率。然后,将每一个词次对应的词频作为向量的一个维度,获取以向量表示的目标训练数据,以便输入模型进行训练,进一步提高风险模型的训练效率。
在一具体实施方式中,如图4所示,步骤S14中,即采用条件随机场算法对目标训练数据进行训练,获取目标风险模型,具体包括如下步骤:
S141:采用极大似然估计算法对目标训练数据进行计算,获取原始风险模型。
其中,极大似然估计算法是利用已知的样本的结果,在使用某个模型的基础上,反推最有可能导致这样结果的模型参数值的估计算法。由于该算法利用了分布函数形式,因此具有得到的估计精度较高的优点。具体地,条件随机场的模型为
Figure PCTCN2018094178-appb-000002
其中,w k表示特征函数的权值,Z(x)表示规范化因子。上述公式表示给定输入序列x(即目标训练数据中的语料数据),对输出序列y(即机构标识)预测的条件概率。其中f k表示特征函数,特征函数通常取值为1或0;当满足特征条件时取值为1,否则为0。具体地,采用极大似然估计算法,对条件随机场的模型参数进行估计。首先对上述公式(即条件随机场的模型公式)取对数,得到如下计算公式
Figure PCTCN2018094178-appb-000003
即原始风险模型。其中,f k表示特征函数;λ k表示特征函数对应的权值,即条件随机场模型中的参数w k;(x i,y i)表示目标训练数据,θ={λ k}。
S142:采用梯度下降算法对原始风险模型进行优化,获取目标风险模型。
其中,梯度下降算法(Gradient Descent)也称为最速下降算法,是在求解机器学习算法的模型参数,即无约束优化问题时,最常采用的方法之一。具体地,采用梯度下降算法对原始风险模型进行多次迭代求导优化求解,得到最小化的损失函数和模型参数值,即在多次迭代求导优化至导数为0时得到所需的模型参数θ,基于此模型参数,获取目标风险模型。本实施例中,对步骤S141中的极大似然函数进行求导,得到计算公式为
Figure PCTCN2018094178-appb-000004
其中,
Figure PCTCN2018094178-appb-000005
表示正则化项,正则化项即惩罚函数,该项对模型向量进行“惩罚”,从而避免过拟合问题。正则化项本质上是一种先验信息。本实施例中,采用梯度下降算法对原始风险模型中的模型参数进行优化,获取目标风险模型,该梯度下降算法计算简单,容易实现。
本实施例中,先对条件随机场模型进行取对数运算,得到似然函数,然后采用极大似然估计算法,对条件随机场模型的模型参数进行估计,由于极大似然估计算法利用了分布函数形式,因此具有得到的估计精度较高的优点,以提高风险模型的准确率。最后,采用梯度下降算法对原始风险模型的模型参数进行优化,获取目标风险模型,以简化模型计算的步骤,提高模型训练的效率。
本实施例中,先获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联,以便基于机构标识对原始训练数据按照同等比例进行划分,获取正负样本,能够有效防止模型训练过拟合的情况,以使通过正负样本训练获得的风险模型的识别效果更加精准。然后,采用结巴分词工具对正负样本进行分词和去停用词处理,获取至少一个词次,以提高模型的准确率和训练效率。并且,在进行分词之前,还可采用中英文对照表对区分出来的英文字符进行映射,获取转换中文字符,以便采用结巴分词工具对转换中文字符进行分词,以提高模型的泛化能力。接着,采用TF-IDF算法对每一个词次进行运算,以获取每一个词次在原始训练数据中的出现次数即词频,容易计算,有利于提高风险模型的训练效率。将每一个词次对应的词频作为向量的一个维度,获取以向量表示的目标训练数据, 以便输入模型进行训练,进一步提高风险模型的训练效率,为后续风险模型训练的输入提供方便。最后,采用条件随机场算法对目标训练数据进行训练,获取目标风险模型,以使该目标风险识别模型具有生成式模型的优点,即考虑到上下文标记间的转移概率的优点,并解决了其他判别式模型难以避免的标记偏置问题,提高模型识别的准确率。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
实施例2
图5示出与实施例1中风险模型训练方法一一对应的风险模型训练装置的原理框图。如图5所示,该风险模型训练装置包括原始训练数据获取模块11、正负样本获取模块12、目标训练数据获取模块13和目标风险模型获取模块14。其中,原始训练数据获取模块11、正负样本获取模块12、目标训练数据获取模块13和目标风险模型获取模块14的实现功能与实施例中风险模型训练方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。
原始训练数据获取模块11,用于获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联。
正负样本获取模块12,用于基于机构标识按照同等比例对原始训练数据进行划分,获取正负样本。
目标训练数据获取模块13,用于对正负样本进行文本向量化处理,获取向量化表示的目标训练数据。
目标风险模型获取模块14,用于采用条件随机场算法对目标训练数据进行训练,获取目标风险模型。
优选地,目标训练数据获取模块13包括词次获取单元131和目标训练数据获取单元132。
词次获取单元131,用于采用结巴分词工具对正负样本进行分词和去停用词处理,获取至少一个词次。
目标训练数据获取单元132,用于对至少一个词次进行向量化处理,获取向量化表示的目标训练数据。
优选地,目标训练数据获取单元132包括词频获取子单元1321和目标训练数据获取子单元1322。
词频获取子单元1321,用于采用TF-IDF算法对至少一个词次进行运算,获取每一词 次对应的词频。
目标训练数据获取子单元1322,用于将每一词次对应的词频作为向量的维度,获取以向量形式表示的目标训练数据。
优选地,目标风险模型获取模块14包括原始风险模型获取单元141和目标风险模型获取单元142。
原始风险模型获取单元141,用于原始风险模型获取单元141,用于采用极大似然估计算法对目标训练数据进行计算,获取原始风险模型。
目标风险模型获取单元142,用于采用梯度下降算法对原始风险模型进行优化,获取目标风险模型。
实施例3
图6示出本实施例中风险模型训练方法的流程图。该风险模型训练方法应用在社交平台上,或者银行、证券、保险等金融机构或需要进行风险识别的其他机构上,以便采用目标风险模型对用户在社交平台或机构内部通信平台发表的特定领域的宣传信息进行风险识别,达到自主锁定风险源的目的。如图6所示,该风险模型训练方法包括如下步骤:
S21:获取与机构标识相对应的待识别数据,待识别数据与用户ID关联。
其中,待识别数据是采用爬虫工具实时采集发表在社交平台或者机构内部通信平台上的需要识别是否存在风险的数据。用户ID是用于识别用户的唯一标识,该用户ID可以是用户登录社交平台或者机构内部通信平台的用户帐号。本实施例中,该待识别数据具体为保险领域的相关数据。具体地,获取机构标识对应的待识别数据,该待识别数据与用户ID相关联,即用户在社交平台或者机构内部通信平台上公开发表过的数据为待识别数据,调用与机构标识相对应的目标风险模型对该待识别数据进行识别,以确定该待识别数据的风险。
具体地,该待识别数据可以采用爬虫工具从社交平台或者机构内部通信平台上公开的数据中爬取,以获取与机构标识相关联的待识别数据。本实施例中,爬虫工具包括但不限于ForeSpider数据采集软件。ForeSpider数据采集软件是可视化的通用性爬虫软件,可以通过简单的两步配置操作就可以采集,软件还自带免费的数据库,可以采集直接入库。在ForeSpider里有一个内置浏览器,在浏览器终输入账号和密码即可登录,还可以设置自动登录,以便下次爬虫时自动登录,实时获取待识别数据,达到实时进行风控的效果。
S22:将待识别数据输入到目标风险模型进行识别,获取风险识别概率。
本实施例中,将待识别数据输入到与机构标识相对应的目标风险模型中进行识别,在目标风险模型中对输入的待识别数据进行计算,并输出风险识别概率。具体地,在获取用户的待识别数据后,将待识别数据在与机构标识相对应的目标风险模型中进行计算,获取风险识别概率。本实施例中,该识别概率可以为0-1之间的实数。
S23:若风险识别概率大于预设概率,则判定待识别数据为高风险数据。
其中,预设概率是预先设置的用于评价与用户相关联的待识别数据是否存在风险的概率。本实施例中,将待识别数据在目标风险模型中进行处理获取的识别概率,与预设概率进行比较。若识别概率大于预设概率,则判定待识别数据为高风险数据。若识别概率小于或等于预设概率,则待识别数据为低风险数据。
进一步地,待识别数据与用户ID相关联,该用户ID与机构标识相关联,若判断判定待识别数据为高风险数据,则说明该用户为高风险用户,即离职高风险用户。例如,若用户为机构标识对应的银行、证券、保险等金融机构或需要进行风险识别的其他机构的员工,该员工在社交平台或者机构内部通信平台上以用户ID为登录账号发表一待识别数据时,在通过与机构标识对应的目标风险模型对该待识别数据进行识别,以确定该待识别数据是否为机构标识对应的目标机构的真实语料数据;若是,则说明该员工传播目标机构的宣传信息(即该识别数据),不是离职高风险用户。若否,则说明该员工传播非目标机构(即其他机构)的宣传信息(即该识别数据),可以据此确定该员工是否有跳槽打算,是离职高风险用户,以便于机构内部人员管理。
本实施例中,先采用爬虫工具从公开数据中进行实时爬取,以获取与机构标识相关联的待识别数据,达到实时进行风控的效果,然后,将待识别数据在目标风险模型中进行计算,获取风险识别识别概率。最后,对风险识别概率进行判断,若风险识别概率大于预设概率,则待识别数据为高风险数据,以便于识别用户在公共社交平台或机构内部通信平台上传播的待识别数据的风险性。
实施例4
图7示出与实施例3中风险识别方法一一对应的风险识别装置的原理框图。如图7所示,该风险识别装置包括待识别数据获取模块21、风险识别概率获取模块22和高风险数据判定模块23。其中,待识别数据获取模块21、风险识别概率获取模块22和高风险数据判定模块23的实现功能与实施例3中风险识别方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。
待识别数据获取模块21,用于获取与机构标识相对应的待识别数据。
风险识别概率获取模块22,用于将待识别数据输入到与机构标识相对应的目标风险模型进行识别,获取风险识别概率,目标风险模型是采用实施例1中风险模型训练方法训练后获取的模型。
高风险数据判定模块23,用于若风险识别概率大于预设概率,则判定待识别数据为高风险数据。
实施例5
本实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例1中风险模型训练方法,为避免重复,这里不再赘述。或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例2中风险模型训练装置中各模块/单元的功能,为避免重复,这里不再赘述;或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例3中风险识别方法,为避免重复,这里不再赘述;或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例4中风险识别装置中各模块/单元的功能,为避免重复,这里不再赘述。
实施例6
图8是本申请一实施例提供的计算机设备的示意图。如图8所示,该实施例的计算机设备80包括:处理器81、存储器82以及存储在存储器82中并可在处理器81上运行的计算机可读指令83。处理器81执行计算机可读指令83时实现上述实施例1中风险模型训练方法的步骤,为避免重复,此处不一一赘述。或者,处理器81执行计算机可读指令83时实现上述实施例2中风险模型训练装置中各模块/单元的功能,为避免重复,此处不一一赘述;或者,处理器81执行计算机可读指令83时实现上述实施例3中风险识别方法的步骤,为避免重复,此处不一一赘述;或者,处理器81执行计算机可读指令83时实现上述实施例4中风险识别装置中各模块/单元的功能,为避免重复,此处不一一赘述。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上 描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种风险模型训练方法,其特征在于,包括:
    获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联;
    基于所述机构标识按照同等比例对所述原始训练数据进行划分,获取正负样本;
    对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据;
    采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型。
  2. 如权利要求1所述的风险模型训练方法,其特征在于,所述对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据,包括:
    采用结巴分词工具对所述正负样本进行分词和去停用词处理,获取至少一个词次;
    对至少一个所述词次进行向量化处理,获取向量化表示的目标训练数据。
  3. 如权利要求2所述的风险模型训练方法,其特征在于,所述对至少一个所述词次进行向量化处理,获取向量化表示的目标训练数据,包括:
    采用TF-IDF算法对至少一个所述词次进行运算,获取每一所述词次对应的词频;
    将每一所述词次对应的词频作为向量的维度,获取以向量形式表示的目标训练数据。
  4. 如权利要求1所述的风险模型训练方法,其特征在于,所述采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型,包括:
    采用极大似然估计算法对所述目标训练数据进行计算,获取原始风险模型;
    采用梯度下降算法对所述原始风险模型进行优化,获取目标风险模型。
  5. 如权利要求4所述的风险模型训练方法,其特征在于,所述极大似然函数算法的计算公式为
    Figure PCTCN2018094178-appb-100001
    其中,f k表示特征函数,λ k表示特征函数对应的权值,(x i,y i)表示所述目标训练数据,Z(x i)表示归一化项;
    所述梯度下降算法的计算公式为
    Figure PCTCN2018094178-appb-100002
    其中,L表示原始风险模型。
  6. 一种风险识别方法,其特征在于,包括:
    获取与机构标识相对应的待识别数据;
    将所述待识别数据输入到与所述机构标识相对应的目标风险模型进行识别,获取风险识别概率,所述目标风险模型是采用权利要求1-5任一项风险模型训练方法训练后获取的模型;
    若风险识别概率大于预设概率,则判定所述待识别数据为高风险数据。
  7. 一种风险模型训练装置,其特征在于,包括:
    原始训练数据获取模块,用于获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联;
    正负样本获取模块,用于基于所述机构标识按照同等比例对所述原始训练数据进行划分,获取正负样本;
    目标训练数据获取模块,用于对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据;
    目标风险模型获取模块,用于采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型。
  8. 一种风险识别装置,其特征在于,其特征在于,包括:
    待识别数据获取模块,用于获取与机构标识相对应的待识别数据;
    风险识别概率获取模块,用于将所述待识别数据输入到与所述机构标识相对应的目标风险模型进行识别,获取风险识别概率,所述目标风险模型是采用权利要求1-5任一项风险模型训练方法训练后获取的模型;
    高风险数据判定模块,用于若风险识别概率大于预设概率,则判定所述待识别数据为高风险数据。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联;
    基于所述机构标识按照同等比例对所述原始训练数据进行划分,获取正负样本;
    对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据;
    采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型。
  10. 如权利要求9所述的计算机设备,其特征在于,所述对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据,包括:
    采用结巴分词工具对所述正负样本进行分词和去停用词处理,获取至少一个词次;
    对至少一个所述词次进行向量化处理,获取向量化表示的目标训练数据。
  11. 如权利要求10所述的计算机设备,其特征在于,所述对至少一个所述词次进行向量化处理,获取向量化表示的目标训练数据,包括:
    采用TF-IDF算法对至少一个所述词次进行运算,获取每一所述词次对应的词频;
    将每一所述词次对应的词频作为向量的维度,获取以向量形式表示的目标训练数据。
  12. 如权利要求9所述的计算机设备,其特征在于,所述采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型,包括:
    采用极大似然估计算法对所述目标训练数据进行计算,获取原始风险模型;
    采用梯度下降算法对所述原始风险模型进行优化,获取目标风险模型。
  13. 如权利要求12所述的计算机设备,其特征在于,所述极大似然函数算法的计算公式为
    Figure PCTCN2018094178-appb-100003
    其中,f k表示特征函数,λ k表示特征函数对应的权值,(x i,y i)表示所述目标训练数据,Z(x i)表示归一化项;
    所述梯度下降算法的计算公式为
    Figure PCTCN2018094178-appb-100004
    其中,L表示原始风险模型。
  14. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取与机构标识相对应的待识别数据;
    将所述待识别数据输入到与所述机构标识相对应的目标风险模型进行识别,获取风险识别概率,所述目标风险模型是采用权利要求1-5任一项风险模型训练方法训练后获取的模型;
    若风险识别概率大于预设概率,则判定所述待识别数据为高风险数据。
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取至少两个机构的原始训练数据,每一原始训练数据与机构标识关联;
    基于所述机构标识按照同等比例对所述原始训练数据进行划分,获取正负样本;
    对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据;
    采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型。
  16. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述对所述正负样本进行文本向量化处理,获取向量化表示的目标训练数据,包括:
    采用结巴分词工具对所述正负样本进行分词和去停用词处理,获取至少一个词次;
    对至少一个所述词次进行向量化处理,获取向量化表示的目标训练数据。
  17. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述对至少一个所述词次进行向量化处理,获取向量化表示的目标训练数据,包括:
    采用TF-IDF算法对至少一个所述词次进行运算,获取每一所述词次对应的词频;
    将每一所述词次对应的词频作为向量的维度,获取以向量形式表示的目标训练数据。
  18. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述采用条件随机场算法对所述目标训练数据进行训练,获取目标风险模型,包括:
    采用极大似然估计算法对所述目标训练数据进行计算,获取原始风险模型;
    采用梯度下降算法对所述原始风险模型进行优化,获取目标风险模型。
  19. 如权利要求18所述的非易失性可读存储介质,其特征在于,所述极大似然函数算法的计算公式为
    Figure PCTCN2018094178-appb-100005
    其中,f k表示特征函数,λ k表示特征函数对应的权值,(x i,y i)表示所述目标训练数据,Z(x i)表示归一化项;
    所述梯度下降算法的计算公式为
    Figure PCTCN2018094178-appb-100006
    其中,L表示原始风险模型。
  20. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取与机构标识相对应的待识别数据;
    将所述待识别数据输入到与所述机构标识相对应的目标风险模型进行识别,获取风险识别概率,所述目标风险模型是采用权利要求1-5任一项风险模型训练方法训练后获取的模型;
    若风险识别概率大于预设概率,则判定所述待识别数据为高风险数据。
PCT/CN2018/094178 2018-03-26 2018-07-03 风险模型训练方法、风险识别方法、装置、设备及介质 WO2019184118A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810250165.2 2018-03-26
CN201810250165.2A CN108520343B (zh) 2018-03-26 2018-03-26 风险模型训练方法、风险识别方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2019184118A1 true WO2019184118A1 (zh) 2019-10-03

Family

ID=63434278

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094178 WO2019184118A1 (zh) 2018-03-26 2018-07-03 风险模型训练方法、风险识别方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN108520343B (zh)
WO (1) WO2019184118A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866394A (zh) * 2019-10-12 2020-03-06 上海数禾信息科技有限公司 公司名称识别方法及装置、计算机设备及可读存储介质
CN110909775A (zh) * 2019-11-08 2020-03-24 支付宝(杭州)信息技术有限公司 一种数据处理方法、装置及电子设备
CN111046655A (zh) * 2019-11-14 2020-04-21 腾讯科技(深圳)有限公司 一种数据处理方法、装置及计算机可读存储介质
CN112687266A (zh) * 2020-12-22 2021-04-20 深圳追一科技有限公司 语音识别方法、装置、计算机设备和存储介质
CN112749565A (zh) * 2019-10-31 2021-05-04 华为终端有限公司 基于人工智能的语义识别方法、装置和语义识别设备
CN113239697A (zh) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 实体识别模型训练方法、装置、计算机设备及存储介质
CN113297998A (zh) * 2021-05-31 2021-08-24 中煤航测遥感集团有限公司 国土空间规划问题的识别方法、装置、设备及存储介质
CN113837764A (zh) * 2021-09-22 2021-12-24 平安科技(深圳)有限公司 风险预警方法、装置、电子设备和存储介质
CN116029808A (zh) * 2023-03-23 2023-04-28 北京芯盾时代科技有限公司 一种风险识别模型训练方法、装置及电子设备
CN116578877A (zh) * 2023-07-14 2023-08-11 之江实验室 一种模型训练及二次优化打标的风险识别的方法及装置

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919608B (zh) * 2018-11-28 2024-01-16 创新先进技术有限公司 一种高危交易主体的识别方法、装置及服务器
CN110032727A (zh) * 2019-01-16 2019-07-19 阿里巴巴集团控股有限公司 风险识别方法及装置
CN109886699A (zh) * 2019-02-18 2019-06-14 北京三快在线科技有限公司 行为识别方法及装置、电子设备、存储介质
CN110135681B (zh) * 2019-04-03 2023-08-22 平安科技(深圳)有限公司 风险用户识别方法、装置、可读存储介质及终端设备
CN110322252B (zh) * 2019-05-30 2023-07-04 创新先进技术有限公司 风险主体识别方法以及装置
CN110321423B (zh) * 2019-05-31 2023-03-31 创新先进技术有限公司 一种文本数据的风险识别方法及服务器
CN112711643B (zh) * 2019-10-25 2023-10-10 北京达佳互联信息技术有限公司 训练样本集获取方法及装置、电子设备、存储介质
CN110956275B (zh) * 2019-11-27 2021-04-02 支付宝(杭州)信息技术有限公司 风险预测和风险预测模型的训练方法、装置及电子设备
CN110942259B (zh) * 2019-12-10 2020-09-29 北方工业大学 社区燃气设备风险评估方法及装置
CN111400764B (zh) * 2020-03-25 2021-05-07 支付宝(杭州)信息技术有限公司 个人信息保护的风控模型训练方法、风险识别方法及硬件
CN115171910A (zh) * 2020-04-22 2022-10-11 第四范式(北京)技术有限公司 生成筛查模型、筛查传染病高风险感染人群的方法及***
CN112118551B (zh) * 2020-10-16 2022-09-09 同盾控股有限公司 设备风险识别方法及相关设备
CN114708109B (zh) * 2022-03-01 2022-11-11 上海钐昆网络科技有限公司 风险识别模型的训练方法、装置、设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9190055B1 (en) * 2013-03-14 2015-11-17 Amazon Technologies, Inc. Named entity recognition with personalized models
CN107038178A (zh) * 2016-08-03 2017-08-11 平安科技(深圳)有限公司 舆情分析方法和装置
CN107798390A (zh) * 2017-11-22 2018-03-13 阿里巴巴集团控股有限公司 一种机器学习模型的训练方法、装置以及电子设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8423488B2 (en) * 2009-12-02 2013-04-16 Fair Isaac Corporation System and method for building a predictive score without model training
CN104636449A (zh) * 2015-01-27 2015-05-20 厦门大学 基于lsa-gcc的分布式大数据***风险识别方法
CN106992994B (zh) * 2017-05-24 2020-07-03 腾讯科技(深圳)有限公司 一种云服务的自动化监控方法和***

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9190055B1 (en) * 2013-03-14 2015-11-17 Amazon Technologies, Inc. Named entity recognition with personalized models
CN107038178A (zh) * 2016-08-03 2017-08-11 平安科技(深圳)有限公司 舆情分析方法和装置
CN107798390A (zh) * 2017-11-22 2018-03-13 阿里巴巴集团控股有限公司 一种机器学习模型的训练方法、装置以及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIN CHUNYU ET AL.: "A model of pre-warning based on the Big data technology for P2P Lenfding paltform", BIG DATA RESEARCH, no. 4, 20 November 2015 (2015-11-20), pages 6-7 - 9 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866394A (zh) * 2019-10-12 2020-03-06 上海数禾信息科技有限公司 公司名称识别方法及装置、计算机设备及可读存储介质
CN112749565A (zh) * 2019-10-31 2021-05-04 华为终端有限公司 基于人工智能的语义识别方法、装置和语义识别设备
CN110909775A (zh) * 2019-11-08 2020-03-24 支付宝(杭州)信息技术有限公司 一种数据处理方法、装置及电子设备
CN111046655B (zh) * 2019-11-14 2023-04-07 腾讯科技(深圳)有限公司 一种数据处理方法、装置及计算机可读存储介质
CN111046655A (zh) * 2019-11-14 2020-04-21 腾讯科技(深圳)有限公司 一种数据处理方法、装置及计算机可读存储介质
CN112687266B (zh) * 2020-12-22 2023-07-21 深圳追一科技有限公司 语音识别方法、装置、计算机设备和存储介质
CN112687266A (zh) * 2020-12-22 2021-04-20 深圳追一科技有限公司 语音识别方法、装置、计算机设备和存储介质
CN113297998A (zh) * 2021-05-31 2021-08-24 中煤航测遥感集团有限公司 国土空间规划问题的识别方法、装置、设备及存储介质
CN113297998B (zh) * 2021-05-31 2024-04-26 中煤航测遥感集团有限公司 国土空间规划问题的识别方法、装置、设备及存储介质
CN113239697B (zh) * 2021-06-01 2023-03-24 平安科技(深圳)有限公司 实体识别模型训练方法、装置、计算机设备及存储介质
CN113239697A (zh) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 实体识别模型训练方法、装置、计算机设备及存储介质
CN113837764A (zh) * 2021-09-22 2021-12-24 平安科技(深圳)有限公司 风险预警方法、装置、电子设备和存储介质
CN113837764B (zh) * 2021-09-22 2023-07-25 平安科技(深圳)有限公司 风险预警方法、装置、电子设备和存储介质
CN116029808A (zh) * 2023-03-23 2023-04-28 北京芯盾时代科技有限公司 一种风险识别模型训练方法、装置及电子设备
CN116029808B (zh) * 2023-03-23 2023-06-30 北京芯盾时代科技有限公司 一种风险识别模型训练方法、装置及电子设备
CN116578877A (zh) * 2023-07-14 2023-08-11 之江实验室 一种模型训练及二次优化打标的风险识别的方法及装置
CN116578877B (zh) * 2023-07-14 2023-12-26 之江实验室 一种模型训练及二次优化打标的风险识别的方法及装置

Also Published As

Publication number Publication date
CN108520343B (zh) 2022-07-19
CN108520343A (zh) 2018-09-11

Similar Documents

Publication Publication Date Title
WO2019184118A1 (zh) 风险模型训练方法、风险识别方法、装置、设备及介质
US11574122B2 (en) Method and system for joint named entity recognition and relation extraction using convolutional neural network
US11748416B2 (en) Machine-learning system for servicing queries for digital content
Pham et al. Semantic labeling: a domain-independent approach
US10692019B2 (en) Failure feedback system for enhancing machine learning accuracy by synthetic data generation
US10579940B2 (en) Joint embedding of corpus pairs for domain mapping
Dashtipour et al. Exploiting deep learning for Persian sentiment analysis
WO2020133960A1 (zh) 文本质检方法、电子装置、计算机设备及存储介质
WO2018056423A1 (ja) シナリオパッセージ分類器、シナリオ分類器、及びそのためのコンピュータプログラム
CN113128203A (zh) 基于注意力机制的关系抽取方法、***、设备及存储介质
US20230367821A1 (en) Machine-learning system for servicing queries for digital content
CN113971210B (zh) 一种数据字典生成方法、装置、电子设备及存储介质
CN113779358A (zh) 一种事件检测方法和***
CN111709225B (zh) 一种事件因果关系判别方法、装置和计算机可读存储介质
CN111753496B (zh) 行业类别识别方法、装置、计算机设备及可读存储介质
Wang et al. Cyber threat intelligence entity extraction based on deep learning and field knowledge engineering
CN117349437A (zh) 基于智能ai的政府信息管理***及其方法
Jagdish et al. Identification of End‐User Economical Relationship Graph Using Lightweight Blockchain‐Based BERT Model
Wu et al. Tedm-pu: A tax evasion detection method based on positive and unlabeled learning
JP2024518458A (ja) テキスト内の自動トピック検出のシステム及び方法
CN111723583B (zh) 基于意图角色的语句处理方法、装置、设备及存储介质
US11829406B1 (en) Image-based document search using machine learning
CN117009516A (zh) 换流站故障策略模型训练方法、推送方法及装置
Ye et al. Deep truth discovery for pattern-based fact extraction
CN114049165B (zh) 一种采购***的商品比价方法、装置、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18912182

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21/01/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18912182

Country of ref document: EP

Kind code of ref document: A1