CN112308453A - Risk identification model training method, user risk identification method and related device - Google Patents

Risk identification model training method, user risk identification method and related device Download PDF

Info

Publication number
CN112308453A
CN112308453A CN202011301542.4A CN202011301542A CN112308453A CN 112308453 A CN112308453 A CN 112308453A CN 202011301542 A CN202011301542 A CN 202011301542A CN 112308453 A CN112308453 A CN 112308453A
Authority
CN
China
Prior art keywords
training
risk
word
current
risk identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011301542.4A
Other languages
Chinese (zh)
Other versions
CN112308453B (en
Inventor
刘宏剑
杨青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Du Xiaoman Technology Beijing Co Ltd
Original Assignee
Shanghai Youyang New Media Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Youyang New Media Information Technology Co ltd filed Critical Shanghai Youyang New Media Information Technology Co ltd
Priority to CN202011301542.4A priority Critical patent/CN112308453B/en
Publication of CN112308453A publication Critical patent/CN112308453A/en
Application granted granted Critical
Publication of CN112308453B publication Critical patent/CN112308453B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Data Mining & Analysis (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Development Economics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Educational Administration (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a risk identification model training method, a user risk identification method and a related device, wherein the training method comprises the following steps: carrying out duplicate removal processing on the search logs in the initial sample, and sequencing each word by using a key dictionary, wherein the key dictionary is set according to the importance degree of the word; intercepting the sequencing result into at least one input text according to a preset length; and taking at least one input text as a training sample, and training the risk recognition model to obtain a target risk recognition model. According to the method, the training samples are obtained by repeatedly searching the log, sequencing the obtained words according to the keyword dictionary and intercepting the words to the preset length, compared with a splicing mode, the length of the training samples is shortened, the training efficiency is improved, even if the interception is carried out, the training samples can keep the word segmentation with higher importance degree due to the fact that the training samples are sequenced based on the keyword dictionary, and meanwhile the training accuracy is guaranteed.

Description

Risk identification model training method, user risk identification method and related device
Technical Field
The invention relates to the technical field of data processing, in particular to a risk identification model training method, a user risk identification method and a related device.
Background
Aiming at each user, when information search is carried out based on a network, a large number of search logs are generated, the search logs usually exist in a text form, the user risk can be identified by using the text, in the existing risk identification process, firstly, a model is built and trained based on methods such as TextCNN, LSTM and pre-trained neural network, etc., so as to obtain a risk identification model, and risk identification is carried out based on the risk identification model, wherein the training process aiming at the TextCNN model comprises the following steps: and splicing the user search logs into a long text, and training the risk label of the user by using a TextCNN neural network. The training process for the LSTM model includes: and splicing the user search logs into a long text, and training the risk label of the user by using the RNN neural network with long short-term memory. The training process for the pre-trained neural network model includes: and splicing the user search logs into a long text, using large-scale corpus training to obtain a pre-trained neural network, and carrying out fine adjustment on the risk label of the user.
However, the existing neural network has long time for inputting and processing long texts and poor recognition effect, so that the accuracy and efficiency of the trained neural network model cannot meet high requirements, and the user risk recognition process is directly influenced.
Disclosure of Invention
In view of the above, the present invention provides a risk recognition model training method, a user risk recognition method, and a related device, so as to solve the problem that the accuracy and efficiency of a neural network model obtained by training cannot meet high requirements and directly affect the user risk recognition process because the long text input processing time is long and the recognition effect is poor by using a neural network at present. The specific scheme is as follows:
a risk recognition model training method comprises the following steps:
obtaining an initial sample;
carrying out duplicate removal processing on the search logs in the initial sample to obtain each word;
sequencing each word by utilizing a key dictionary to obtain a sequencing result, wherein the key dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words;
intercepting the sequencing result into at least one input text according to a preset length;
and training a risk recognition model by taking the at least one input text as a training sample to obtain a target risk recognition model, wherein the risk recognition model is constructed based on an Embedding layer and a Transformer structure.
Optionally, the method may further include:
all the search logs of each user are spliced to obtain a spliced text;
segmenting the spliced text to obtain each word;
calculating the high-low risk discrimination degree and the occurrence frequency number corresponding to each word, and taking the product of the high-low risk discrimination degree and the frequency number as the importance value of the word segmentation;
and sequencing the words based on the importance value to obtain the keyword dictionary.
The method optionally calculates the high-low risk differentiation degree corresponding to each word, and includes:
counting the proportion H of high-risk users and the proportion L of low-risk users of all the users who search the word;
obtaining the proportion H 'of high-risk users and the proportion L' of low-risk users in all users based on a preset formula
Figure BDA0002787055530000021
And calculating high and low risk discrimination, wherein R represents the high and low risk discrimination.
Optionally, the method for constructing a risk identification model based on the Embedding layer and the transform structure includes:
training a text prediction model based on a preset training corpus to obtain a target text prediction model, wherein the target text training model comprises: the Embedding layer and the transform structure;
when training is completed, acquiring the Embedding layer and the Transformer structure;
adding a risk identification layer, and constructing the risk identification model based on the order of the Embedding layer, the Transformer structure and the risk identification layer.
The above method, optionally, further includes:
acquiring the length of the sequencing result;
and under the condition that the length is smaller than the preset length, adding blanks into the sequencing result to supplement the blanks into the preset length.
A user risk identification method, comprising:
under the condition that a risk identification request for a current user is received, calling a target risk identification model, wherein the target risk identification model is obtained by training based on the training method;
acquiring the current search log of the current user, and performing duplicate removal processing on the current search log to obtain each current word;
sequencing each current word according to the keyword dictionary to obtain a current sequencing result;
intercepting the sequencing result into a current input text according to a preset length;
and transmitting the current input text to the target risk identification model for risk identification.
A risk recognition model training apparatus, comprising:
the initial sample acquisition module is used for acquiring an initial sample;
the first duplicate removal module is used for carrying out duplicate removal processing on the search logs in the initial sample to obtain each word;
the first sequencing module is used for sequencing each word by utilizing a key dictionary to obtain a sequencing result, wherein the key dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words;
the first intercepting module is used for intercepting the sequencing result into at least one input text according to a preset length;
and the training module is used for training a risk recognition model by taking the at least one input text as a training sample to obtain a target risk recognition model, wherein the risk recognition model is constructed based on an Embedding layer and a Transformer structure.
Optionally, the above apparatus, where the process of establishing the keyword dictionary in the first sorting module includes:
the splicing unit is used for splicing all the search logs of each user to obtain a spliced text;
the word segmentation unit is used for segmenting the spliced text to obtain each word;
the calculation unit is used for calculating the high-low risk discrimination degree and the occurrence frequency number corresponding to each word, and taking the product of the high-low risk discrimination degree and the occurrence frequency number as the importance value of the participle;
and the sequencing unit is used for sequencing the words based on the importance value to obtain the keyword dictionary.
Optionally, the above apparatus, where the risk identification model is built in the training module based on an Embedding layer and a transform structure, includes:
the training unit is used for training the text prediction model based on preset training corpora to obtain a target text prediction model, wherein the target text training model comprises: the Embedding layer and the transform structure;
the acquisition unit is used for acquiring the Embedding layer and the Transformer structure when training is finished;
and the construction unit is used for adding a risk identification layer and constructing the risk identification model based on the order of the Embedding layer, the Transformer structure and the risk identification layer.
A user risk identification device comprising:
the system comprises a calling module, a target risk recognition module and a training module, wherein the calling module is used for calling a target risk recognition model under the condition that a risk recognition request for a current user is received, and the target risk recognition model is obtained by training based on the training method of any one of claims 1-5;
the second duplicate removal module is used for acquiring the current search log of the current user and performing duplicate removal processing on the current search log to obtain each current word;
the second sequencing module is used for sequencing all the current words according to the keyword dictionary to obtain a current sequencing result;
the second intercepting module is used for intercepting the sequencing result into a current input text according to a preset length;
and the identification module is used for transmitting the current input text to the target risk identification model for risk identification.
Compared with the prior art, the invention has the following advantages:
the invention discloses a risk identification model training method, a user risk identification method and a related device, wherein the training method comprises the following steps: obtaining an initial sample; carrying out duplicate removal processing on the search logs in the initial sample to obtain each word; sorting each word by using a key dictionary to obtain a sorting result, wherein the key dictionary is pre-established and comprises a plurality of words, and the order of the words is set according to the importance degree of the words; intercepting the sequencing result into at least one input text according to a preset length; and taking at least one input text as a training sample, and training the risk recognition model to obtain a target risk recognition model. In the training process, the training samples are obtained by carrying out repeated processing on the search logs, sequencing each word according to the keyword dictionary and intercepting the sequencing result into the preset length, compared with the prior art that the training samples are directly spliced, the training efficiency is improved, even if the training samples are intercepted, the training samples are sequenced based on the keyword dictionary, the segmentation with higher importance degree can be reserved, and meanwhile, the training accuracy is ensured.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a risk identification model training method in the prior art;
FIG. 2 is a schematic diagram of a text prediction model according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a risk identification model disclosed in an embodiment of the present application;
fig. 4 is a flowchart of a user risk identification method disclosed in an embodiment of the present application;
FIG. 5 is a block diagram of a risk identification model training apparatus according to an embodiment of the present disclosure;
fig. 6 is a block diagram of a user risk identification apparatus according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The invention discloses a risk recognition model training method, a user risk recognition method and a related device, which are used for training a risk recognition model in advance before user risk recognition in the process of recognizing user risks, wherein the optimized existing risk recognition model is constructed based on methods such as TextCNN, LSTM, pre-training neural network and the like. Therefore, in order to solve the problems of low accuracy and low efficiency, the invention provides a risk identification model training method, the execution flow of which is shown in fig. 1, and the method comprises the following steps:
s101, obtaining an initial sample;
in the embodiment of the invention, the initial sample is obtained from a preset sample library, wherein the initial sample comprises at least one risk level of a user and a search log corresponding to the user.
S102, carrying out duplicate removal processing on the search logs in the initial sample to obtain each word;
in the embodiment of the present invention, search logs in the initial sample are obtained, and duplicate removal processing is performed on each search log, where the duplicate removal method is as follows: the search logs are segmented, a specific analysis method is not limited in the embodiment of the invention, each initial word corresponding to the search logs is obtained, repeated items of the repeated words are removed in a hash mode, and only one word is left to obtain each word.
S103, sequencing the words by using a key dictionary to obtain a sequencing result, wherein the key dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words;
in the embodiment of the invention, a keyword dictionary is established in advance, wherein the keyword dictionary is established in the following process that search log texts of each user are spliced together for word segmentation; then traversing the search logs of all users, counting the number of users corresponding to each word (namely the number of users who have searched the word), recording the number as word frequency T, meanwhile, counting the proportion H of high-risk users and the proportion L of low-risk users in the users who have searched the word, recording the proportion H 'of the high-risk users and the proportion L' of the low-risk users in all the users, and based on a preset formula
Figure BDA0002787055530000071
The word importance calculation formula is T R, and the words are sorted from large to small by using the word importance to construct a keyword dictionary.
Further, different calculation formulas can be used for calculating the importance of the words by using the frequency of the words and the risk division, but the main idea is to use the product of the frequency and the risk division.
In the embodiment of the invention, after the keyword dictionary is constructed, the keyword dictionary is traversed for each word, the sequence of the word in the keyword dictionary is determined, and the words are sequenced according to the sequence to obtain a sequencing result.
S104, intercepting the sequencing result into at least one input text according to a preset length;
in the embodiment of the present invention, a preset length is set based on experience or a specific application scenario, preferably, a length of each sorting result in the initial sample is obtained, the length is compared with the preset length, when the length is greater than the preset length, the sorting result is intercepted as an input text according to the preset length, when the length is less than the preset length, a blank is added to the sorting result and supplemented to the preset length, and the added blank is used as the input text, where at least one input text is used.
And S105, taking the at least one input text as a training sample, and training a risk recognition model to obtain a target risk recognition model, wherein the risk recognition model is constructed based on an Embedding layer and a Transformer structure.
In the embodiment of the invention, a risk identification model is constructed in advance, and the risk identification model is constructed based on an Embedding layer and a Transformer structure, wherein the Embedding layer and the Transformer structure are obtained in a target text prediction model, the target text prediction model trains a text prediction model based on preset training corpora, a schematic diagram of the text prediction model is shown in FIG. 2, and the Embedding layer in the text prediction model represents a vector for mapping characters into fixed dimensions; the Masked multi Self Attention mechanism Layer represents a Self Attention mechanism, the Layer Norm Layer represents regularization, the Feed Forward Layer represents two full-connection layers, and the text prediction Layer represents a full-connection Layer and a loss function; the whole part of the dashed box is called a transform structure, and this structure is repeated 4 times. The reason for repeating four times is that the adjustable parameters are more if repeating four times; the multi-layer structure can obtain information with higher abstract level, such as sentence length and word meaning information obtained by a lower layer network and grammatical structure information obtained by a higher layer network; higher layers obtain semantic information. The operation quantity is increased by repeating too many times, and the effect improvement is not obvious, so that the operation quantity and the effect are compromised by four times. The top of the text input layer is added with an extra character to represent the information of the whole sentence. For a text input of a certain length, a text output of the same length is preceded by a text prediction layer, each position is a vector and represents text information of a corresponding position, and the first position represents information of an entire sentence.
The training is performed by adopting a preset training corpus based on a general neural network model training method, and the specific training process is related to the selection of the loss function and is not repeated herein. After training is finished, obtaining a target text prediction model, wherein the target text prediction model has the capability of performing feature representation on a text, and predicting based on the target text prediction model, for example, the preset training corpus is 'true good weather today' and randomly replaces partial words or characters in the preset training corpus with masks, and the preset training corpus may become 'true good weather today'; then, the replaced text is input as a target text prediction model, and after word Embedding (Embedding) and 4 transform structures, the replaced text is predicted, in this example, the word of "day" is predicted after the position corresponding to [ M ], and classification loss is used.
Acquiring the Embedding layer and the transform structure in the target text training model, adding a risk identification layer, and constructing the risk identification model based on the sequence of the Embedding layer, the transform structure and the risk identification layer, wherein a structural block diagram of the risk identification model is shown in FIG. 3, and the Embedding layer in the text prediction model in the risk identification model represents a vector for mapping characters into fixed dimensions; the method comprises the steps that a Masked multi Self Attention mechanism is represented, a Layer Norm Layer represents regularization, a Feed Forward Layer represents two full-connection layers, the whole part of a dotted line frame is called a transform structure, the structure is repeated for 4 times, preferably, a risk label of a risk identification Layer is high risk or low risk, a label of two classifications is output, the training sample is transmitted to the risk identification model to be trained on the basis of at least one input text serving as a training sample, and a target risk identification model is obtained.
Further, for the target risk identification model, risk identification may also be performed based on LSTM and TextCNN, and for a risk identification layer in the target risk identification model, other forms of tags may also be output, and a specific existence form of the tag is not limited in the embodiment of the present invention.
The invention discloses a risk identification model training method, which comprises the following steps: obtaining an initial sample; carrying out duplicate removal processing on the search logs in the initial sample to obtain each word; sorting each word by using a key dictionary to obtain a sorting result, wherein the key dictionary is pre-established and comprises a plurality of words, and the order of the words is set according to the importance degree of the words; intercepting the sequencing result into at least one input text according to a preset length; and taking at least one input text as a training sample, and training the risk recognition model to obtain a target risk recognition model. In the training process, the training samples are obtained by carrying out repeated processing on the search logs, sequencing each word according to the keyword dictionary and intercepting the sequencing result into the preset length, compared with the prior art that the training samples are directly spliced, the training efficiency is improved, even if the training samples are intercepted, the training samples are sequenced based on the keyword dictionary, the segmentation with higher importance degree can be reserved, and meanwhile, the training accuracy is ensured.
In the embodiment of the present invention, based on the target risk identification model, an embodiment of the present invention further provides a user risk identification method, where an execution flow of the identification method is shown in fig. 4, and the method includes the steps of:
s201, under the condition that a risk identification request for a current user is received, calling a target risk identification model, wherein the target risk identification model is obtained by training based on the training method;
in the embodiment of the invention, the target risk identification model is called under the condition that a risk identification request for the current user is received, wherein the target risk identification model is obtained by training based on the training method, and the risk identification is carried out on the current user based on the target risk identification model.
S202, obtaining a current search log of the current user, and performing duplicate removal processing on the current search log to obtain each current word;
in the embodiment of the present invention, based on the name, the number, or other preferred identifiers of the current user, a current search log corresponding to the current user is obtained, where the current search log is a log of a search of the current user, and the current log is subjected to deduplication processing to obtain each current word, where a deduplication processing process is the same as the processing process described in S102, and is not described here again.
S203, sequencing the current words according to the keyword dictionary to obtain a current sequencing result;
in the embodiment of the present invention, the sorting process is the same as that described in S103, and is not described herein again.
S204, intercepting the sequencing result into a current input text according to a preset length;
in the embodiment of the present invention, the process of intercepting is the same as that described in the above 104, and is not described herein again.
S205, transmitting the current input text to the target risk identification model for risk identification.
In the embodiment of the invention, the current input text is transmitted to the risk identification model for identification, and whether the current user is a low-risk user or a high-risk user is determined.
The invention discloses a user risk identification method, which comprises the following steps: under the condition that a risk identification request for a current user is received, calling a target risk identification model, wherein the target risk identification model is obtained by training based on the training method of any one of claims 1-5; acquiring the current search log of the current user, and performing duplicate removal processing on the current search log to obtain each current word; sequencing each current word according to the keyword dictionary to obtain a current sequencing result; intercepting the sequencing result into a current input text according to a preset length; and transmitting the current input text to the target risk identification model for risk identification. In the recognition process, the current input text is obtained by carrying out reprocessing on the current search log to obtain the single sign-on words and sequencing the single sign-on words according to the keyword dictionary and intercepting the sequencing result to the preset length.
Furthermore, the risk identification method of the invention is a method for constructing a keyword dictionary by comprehensively utilizing the frequency of the words and the risk division of the words; the coverage rate of words and the distinguishing performance of the words are considered, so that the selected words can be well distinguished from the risk users, a keyword dictionary is formed, and the keyword dictionary is used for extracting key information from the long text; arranging important words in front of the text, and keeping the words as much as possible when the text is cut off; the risk identification model is finely adjusted based on the user search log and the risk labels to obtain a target risk identification model, risk identification is carried out based on the target risk identification model, and the identification method can better capture the co-occurrence relationship between the search log and the risk labels.
Based on the above training method for the risk recognition model, in an embodiment of the present invention, a training device for the risk recognition model is further provided, and a structural block diagram of the training device is shown in fig. 5, where the training device includes:
an initial sample acquisition module 301, a first deduplication module 302, a first ordering module 303, a first truncation module 304, and a training module 305.
Wherein the content of the first and second substances,
the initial sample obtaining module 301 is configured to obtain an initial sample;
the first duplicate removal module 302 is configured to perform duplicate removal processing on the search log in the initial sample to obtain each word;
the first sorting module 303 is configured to sort the words by using a keyword dictionary to obtain a sorting result, where the keyword dictionary is pre-established and includes a plurality of words, and the order of the words is set according to the importance degree of the words;
the first truncating module 304 is configured to truncate the sorting result into at least one input text according to a preset length;
the training module 305 is configured to train a risk recognition model by using the at least one input text as a training sample to obtain a target risk recognition model, where the risk recognition model is constructed based on an Embedding layer and a transform structure.
The invention discloses a risk identification model training device, which comprises: obtaining an initial sample; carrying out duplicate removal processing on the search logs in the initial sample to obtain each word; sorting each word by using a key dictionary to obtain a sorting result, wherein the key dictionary is pre-established and comprises a plurality of words, and the order of the words is set according to the importance degree of the words; intercepting the sequencing result into at least one input text according to a preset length; and taking at least one input text as a training sample, and training the risk recognition model to obtain a target risk recognition model. In the training process, the training samples are obtained by carrying out repeated processing on the search logs, sequencing each word according to the keyword dictionary and intercepting the sequencing result into the preset length, compared with the prior art that the training samples are directly spliced, the training efficiency is improved, even if the training samples are intercepted, the training samples are sequenced based on the keyword dictionary, the segmentation with higher importance degree can be reserved, and meanwhile, the training accuracy is ensured.
In this embodiment of the present invention, the process of establishing the keyword dictionary in the first sorting module 303 includes:
a concatenation unit 306, a segmentation unit 307, a calculation unit 308 and an ordering unit 309.
Wherein the content of the first and second substances,
the splicing unit 306 is configured to splice all the search logs of each user to obtain a spliced text;
the word segmentation unit 307 is configured to perform word segmentation on the spliced text to obtain each word;
the calculating unit 308 is configured to calculate a high-low risk differentiation degree and an occurrence frequency corresponding to each word, and take a product of the high-low risk differentiation degree and the occurrence frequency as an importance value of the word segmentation;
the sorting unit 309 is configured to sort the words based on the importance values to obtain the keyword dictionary.
In the embodiment of the present invention, the constructing a risk identification model based on an Embedding layer and a transform structure in the training module 305 includes:
a training unit 310, an acquisition unit 311 and a construction unit 312.
Wherein the content of the first and second substances,
the training unit 310 is configured to train a text prediction model based on a preset training corpus to obtain a target text prediction model, where the target text training model includes: the Embedding layer and the transform structure;
the obtaining unit 311 is configured to obtain the embed layer and the Transformer structure when training is completed;
the constructing unit 312 is configured to add a risk identification layer, and construct the risk identification model based on the order of the Embedding layer, the Transformer structure, and the risk identification layer.
Based on the above user risk identification method, in an embodiment of the present invention, a user risk identification device is further provided, and a structural block diagram of the identification device is shown in fig. 6, where the structural block diagram includes:
a calling module 401, a second deduplication module 402, a second sorting module 403, a second interception module 404, and an identification module 405.
Wherein the content of the first and second substances,
the invoking module 401 is configured to invoke a target risk recognition model when a risk recognition request for a current user is received, where the target risk recognition model is obtained by training based on the training method of any one of claims 1 to 5;
the second duplicate removal module 402 is configured to obtain a current search log of the current user, and perform duplicate removal processing on the current search log to obtain each current word;
the second sorting module 403 is configured to sort the current words according to the keyword dictionary to obtain a current sorting result;
the second intercepting module 404 is configured to intercept the sorting result as a current input text according to a preset length;
the recognition module 405 is configured to transmit the current input text to the target risk recognition model for risk recognition.
The invention discloses a user risk identification device, comprising: under the condition that a risk identification request for a current user is received, calling a target risk identification model, wherein the target risk identification model is obtained by training based on the training method of any one of claims 1-5; acquiring the current search log of the current user, and performing duplicate removal processing on the current search log to obtain each current word; sequencing each current word according to the keyword dictionary to obtain a current sequencing result; intercepting the sequencing result into a current input text according to a preset length; and transmitting the current input text to the target risk identification model for risk identification. In the recognition process, the current input text is obtained by carrying out reprocessing on the current search log to obtain the single sign-on words and sequencing the single sign-on words according to the keyword dictionary and intercepting the sequencing result to the preset length.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The risk identification model training method, the user risk identification method and the related devices provided by the invention are described in detail, specific examples are applied in the text to explain the principle and the implementation mode of the invention, and the description of the above embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A risk recognition model training method is characterized by comprising the following steps:
obtaining an initial sample;
carrying out duplicate removal processing on the search logs in the initial sample to obtain each word;
sequencing each word by utilizing a key dictionary to obtain a sequencing result, wherein the key dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words;
intercepting the sequencing result into at least one input text according to a preset length;
and training a risk recognition model by taking the at least one input text as a training sample to obtain a target risk recognition model, wherein the risk recognition model is constructed based on an Embedding layer and a Transformer structure.
2. The method of claim 1, wherein the keyword dictionary establishing process comprises:
all the search logs of each user are spliced to obtain a spliced text;
segmenting the spliced text to obtain each word;
calculating the high-low risk discrimination degree and the occurrence frequency number corresponding to each word, and taking the product of the high-low risk discrimination degree and the frequency number as the importance value of the word segmentation;
and sequencing the words based on the importance value to obtain the keyword dictionary.
3. The method of claim 2, wherein calculating a high-low risk score for each word comprises:
counting the proportion H of high-risk users and the proportion L of low-risk users of all the users who search the word;
obtaining the proportion H 'of high-risk users and the proportion L' of low-risk users in all users based on a preset formula
Figure FDA0002787055520000011
Calculate altitude and altitudeAnd (4) risk discrimination, wherein R represents high and low risk discrimination.
4. The method of claim 1, wherein constructing a risk identification model based on the Embedding layer and the transform structure comprises:
training a text prediction model based on a preset training corpus to obtain a target text prediction model, wherein the target text training model comprises: the Embedding layer and the transform structure;
when training is completed, acquiring the Embedding layer and the Transformer structure;
adding a risk identification layer, and constructing the risk identification model based on the order of the Embedding layer, the Transformer structure and the risk identification layer.
5. The method of claim 1, further comprising:
acquiring the length of the sequencing result;
and under the condition that the length is smaller than the preset length, adding blanks into the sequencing result to supplement the blanks into the preset length.
6. A method for identifying user risk, comprising:
under the condition that a risk identification request for a current user is received, calling a target risk identification model, wherein the target risk identification model is obtained by training based on the training method of any one of claims 1-5;
acquiring the current search log of the current user, and performing duplicate removal processing on the current search log to obtain each current word;
sequencing each current word according to the keyword dictionary to obtain a current sequencing result;
intercepting the sequencing result into a current input text according to a preset length;
and transmitting the current input text to the target risk identification model for risk identification.
7. A risk recognition model training device, comprising:
the initial sample acquisition module is used for acquiring an initial sample;
the first duplicate removal module is used for carrying out duplicate removal processing on the search logs in the initial sample to obtain each word;
the first sequencing module is used for sequencing each word by utilizing a key dictionary to obtain a sequencing result, wherein the key dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words;
the first intercepting module is used for intercepting the sequencing result into at least one input text according to a preset length;
and the training module is used for training a risk recognition model by taking the at least one input text as a training sample to obtain a target risk recognition model, wherein the risk recognition model is constructed based on an Embedding layer and a Transformer structure.
8. The apparatus of claim 7, wherein the process of building the keyword dictionary in the first ranking module comprises:
the splicing unit is used for splicing all the search logs of each user to obtain a spliced text;
the word segmentation unit is used for segmenting the spliced text to obtain each word;
the calculation unit is used for calculating the high-low risk discrimination degree and the occurrence frequency number corresponding to each word, and taking the product of the high-low risk discrimination degree and the occurrence frequency number as the importance value of the participle;
and the sequencing unit is used for sequencing the words based on the importance value to obtain the keyword dictionary.
9. The apparatus according to claim 7, wherein the building of the risk recognition model based on the Embedding layer and the transform structure in the training module comprises:
the training unit is used for training the text prediction model based on preset training corpora to obtain a target text prediction model, wherein the target text training model comprises: the Embedding layer and the transform structure;
the acquisition unit is used for acquiring the Embedding layer and the Transformer structure when training is finished;
and the construction unit is used for adding a risk identification layer and constructing the risk identification model based on the order of the Embedding layer, the Transformer structure and the risk identification layer.
10. A user risk identification device, comprising:
the system comprises a calling module, a target risk recognition module and a training module, wherein the calling module is used for calling a target risk recognition model under the condition that a risk recognition request for a current user is received, and the target risk recognition model is obtained by training based on the training method of any one of claims 1-5;
the second duplicate removal module is used for acquiring the current search log of the current user and performing duplicate removal processing on the current search log to obtain each current word;
the second sequencing module is used for sequencing all the current words according to the keyword dictionary to obtain a current sequencing result;
the second intercepting module is used for intercepting the sequencing result into a current input text according to a preset length;
and the identification module is used for transmitting the current input text to the target risk identification model for risk identification.
CN202011301542.4A 2020-11-19 2020-11-19 Risk identification model training method, user risk identification method and related devices Active CN112308453B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011301542.4A CN112308453B (en) 2020-11-19 2020-11-19 Risk identification model training method, user risk identification method and related devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011301542.4A CN112308453B (en) 2020-11-19 2020-11-19 Risk identification model training method, user risk identification method and related devices

Publications (2)

Publication Number Publication Date
CN112308453A true CN112308453A (en) 2021-02-02
CN112308453B CN112308453B (en) 2023-04-28

Family

ID=74335351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011301542.4A Active CN112308453B (en) 2020-11-19 2020-11-19 Risk identification model training method, user risk identification method and related devices

Country Status (1)

Country Link
CN (1) CN112308453B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263158A (en) * 2019-05-24 2019-09-20 阿里巴巴集团控股有限公司 A kind of processing method of data, device and equipment
CN110955750A (en) * 2019-11-11 2020-04-03 北京三快在线科技有限公司 Combined identification method and device for comment area and emotion polarity, and electronic equipment
CN110968689A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Training method of criminal name and law bar prediction model and criminal name and law bar prediction method
CN111144131A (en) * 2019-12-25 2020-05-12 北京中科研究院 Network rumor detection method based on pre-training language model
CN111401062A (en) * 2020-03-25 2020-07-10 支付宝(杭州)信息技术有限公司 Text risk identification method, device and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968689A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Training method of criminal name and law bar prediction model and criminal name and law bar prediction method
CN110263158A (en) * 2019-05-24 2019-09-20 阿里巴巴集团控股有限公司 A kind of processing method of data, device and equipment
CN110955750A (en) * 2019-11-11 2020-04-03 北京三快在线科技有限公司 Combined identification method and device for comment area and emotion polarity, and electronic equipment
CN111144131A (en) * 2019-12-25 2020-05-12 北京中科研究院 Network rumor detection method based on pre-training language model
CN111401062A (en) * 2020-03-25 2020-07-10 支付宝(杭州)信息技术有限公司 Text risk identification method, device and equipment

Also Published As

Publication number Publication date
CN112308453B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN107832414B (en) Method and device for pushing information
CN109657054B (en) Abstract generation method, device, server and storage medium
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN110334178B (en) Data retrieval method, device, equipment and readable storage medium
CN112347244B (en) Yellow-based and gambling-based website detection method based on mixed feature analysis
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
US20170185680A1 (en) Chinese website classification method and system based on characteristic analysis of website homepage
CN112256939B (en) Text entity relation extraction method for chemical field
CN108334489B (en) Text core word recognition method and device
CN113961685A (en) Information extraction method and device
US11966455B2 (en) Text partitioning method, text classifying method, apparatus, device and storage medium
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
CN110825998A (en) Website identification method and readable storage medium
CN114896305A (en) Smart internet security platform based on big data technology
CN112188312A (en) Method and apparatus for determining video material of news
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN111538903B (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
CN113722492A (en) Intention identification method and device
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN113961666B (en) Keyword recognition method, apparatus, device, medium, and computer program product
CN111414471A (en) Method and apparatus for outputting information
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: B7-7-2, Yuxing Plaza, No.5, Huangyang Road, Yubei District, Chongqing

Applicant after: Chongqing duxiaoman Youyang Technology Co.,Ltd.

Address before: 201800 room 307, 3 / F, building 8, 55 Huiyuan Road, Jiading District, Shanghai

Applicant before: SHANGHAI YOUYANG NEW MEDIA INFORMATION TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
TA01 Transfer of patent application right

Effective date of registration: 20211220

Address after: 100193 Room 606, 6 / F, building 4, West District, courtyard 10, northwest Wangdong Road, Haidian District, Beijing

Applicant after: Du Xiaoman Technology (Beijing) Co.,Ltd.

Address before: B7-7-2, Yuxing Plaza, No.5, Huangyang Road, Yubei District, Chongqing

Applicant before: Chongqing duxiaoman Youyang Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant