CN113254649A

CN113254649A - Sensitive content recognition model training method, text recognition method and related device

Info

Publication number: CN113254649A
Application number: CN202110691212.9A
Authority: CN
Inventors: 成杰峰; 彭奕
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2021-08-13
Anticipated expiration: 2041-06-22
Also published as: CN113254649B

Abstract

The invention provides a training method of a sensitive content recognition model, which comprises the following steps: acquiring a plurality of user texts and a plurality of user accounts; constructing a knowledge graph based on the plurality of user accounts and the incidence relation among the user accounts, wherein the knowledge graph comprises a plurality of nodes; acquiring a plurality of account characteristic vectors according to the plurality of nodes; extracting a plurality of text feature vectors of the plurality of user texts; splicing each user account feature vector with one or more corresponding text feature vectors to obtain a plurality of fusion feature vectors; and taking the plurality of fusion feature vectors as a plurality of groups of training samples, respectively inputting each group of training samples into a classification model to be trained, and training the classification model to be trained to obtain the sensitive content recognition model. According to the invention, the user text and the user account are subjected to feature fusion, and the recognition accuracy and the training efficiency are improved through the fused feature training and the recognition of the use sensitive content.

Description

Sensitive content recognition model training method, text recognition method and related device

Technical Field

The invention relates to the field of artificial intelligence, in particular to a training method of a sensitive content recognition model, a text recognition method and a related device.

Background

Existing sensitive content recognition models typically employ supervised machine learning schemes, such as text classifiers based on CNN models; the recognition capability of the text classifier on the sensitive content depends on the information amount of the labeled sample; the content-dependent information amount is not enough to train a good text classifier, so that the existing text classifier cannot correctly identify the sensitive information text category which does not appear in the labeled sample.

The existing training model such as an ELMO (embedded language model) is huge in size, and the huge model size causes that a method for processing a text by applying a pre-training model needs a lot of time and is difficult to apply to an actual scene, so that it is very important to find an accurate and efficient text training method which is convenient to apply.

Disclosure of Invention

The invention aims to provide a training method, a text recognition method, computer equipment and a computer-readable storage medium of a sensitive content recognition model, which are used for solving the following problems: the prior art cannot correctly identify the sensitive information text category which does not appear in the labeled sample.

The first aspect of the embodiments of the present invention provides a method for training a sensitive content recognition model, including:

acquiring a plurality of user texts and a plurality of user accounts; each user text is a sensitive content text or a non-sensitive content text, and each user text is respectively associated with one user account; constructing a knowledge graph based on the plurality of user accounts and the incidence relation among the user accounts; the knowledge graph comprises a plurality of nodes, and each node corresponds to one of the plurality of user accounts; obtaining a plurality of account feature vectors according to the plurality of nodes, wherein each account feature vector corresponds to one node in the plurality of nodes; extracting a plurality of text feature vectors of the plurality of user texts, wherein each text feature vector corresponds to one user text; splicing each user account feature vector with one or more corresponding text feature vectors to obtain a plurality of fusion feature vectors; and taking the plurality of fusion feature vectors as a plurality of groups of training samples, and respectively inputting each group of training samples into a classification model to be trained so as to train the classification model to be trained to obtain the sensitive content recognition model.

Optionally, the step of constructing a knowledge graph based on the plurality of user accounts and the association relationship between the user accounts includes: acquiring account information of each user account in the plurality of user accounts to obtain a plurality of user account information; acquiring a plurality of groups of associated accounts based on the plurality of user account information; the user account information comprises registration login information of corresponding user accounts, and each group of associated accounts comprises two user accounts with the same at least one user account information; constructing a knowledge graph according to the multiple groups of associated accounts; each user account corresponds to a node in the knowledge graph, and the same user account information between two user accounts of each group of associated accounts is used for constructing an edge between the two corresponding nodes.

Optionally, the step of obtaining a plurality of account feature vectors according to the plurality of nodes includes: embedding a plurality of nodes and a plurality of edges corresponding to a plurality of groups of associated accounts on the knowledge graph into an objective function, and calculating a plurality of account feature vectors corresponding to the plurality of nodes through the objective function, wherein the objective function is as follows:

wherein E is_ijRepresenting the weight of the edge; phi (u)_i)、φ(u_j) Respectively represent the ith node v_iJ (th) node v_j；φ'(u_j) Representing a node v_jAdjacent node of u_i、u_jRespectively represent nodes v_iNode v_jIs represented by the account feature vector of (1).

Optionally, the step of extracting a plurality of text feature vectors of the plurality of user texts, where each text feature vector corresponds to one user text, includes: preprocessing the user texts to obtain a plurality of vector matrixes; inputting the plurality of vector matrices into a convolutional neural network to obtain a plurality of text feature vectors corresponding to the plurality of vector matrices, each vector matrix corresponding to one text feature vector of the plurality of text feature vectors.

Optionally, the step of preprocessing the user texts to obtain a plurality of vector matrices includes: performing word segmentation processing on each sentence in a plurality of user texts to obtain a word segmentation set of each user text; coding each word in each participle set so as to convert each word in each participle set into a corresponding word vector; taking sentences of each user text as a unit, and acquiring a vector matrix of each sentence; the vector matrix is constructed from a plurality of word vectors corresponding to a plurality of words of a respective sentence, each row of the vector matrix corresponding to a word vector.

One aspect of the embodiments of the present invention further provides a text recognition method, including: determining a target user text to be processed; extracting a text feature vector of the target user text according to the target user text; searching a target user account associated with the target user text; acquiring an account feature vector corresponding to the target user account according to the target user account; splicing the account characteristic vector and the text characteristic vector to obtain a fusion characteristic vector; and inputting the fusion feature vector into a trained sensitive content recognition model so as to output the text type of the target user text through the sensitive content recognition model, wherein the text type is a sensitive content text or a non-sensitive content text, and the sensitive content recognition model is a model obtained by training through the sensitive content recognition model.

Optionally, the sensitive content recognition model includes a plurality of classifiers, and the step of inputting the fused feature vector into the trained sensitive content recognition model to output the target text type of the target user text through the sensitive content recognition model includes:

inputting the fused feature vector into each classifier of the plurality of classifiers to obtain a plurality of text types; the plurality of text types correspond to a plurality of output results of the plurality of classifiers one by one; and determining the text type with the number ratio larger than a preset threshold value as the target text type according to the plurality of text types.

One aspect of the embodiments of the present invention further provides a system for training a sensitive content recognition model, including:

the acquisition module is used for acquiring a plurality of user texts and a plurality of user accounts; each user text is a sensitive content text or a non-sensitive content text, and each user text is respectively associated with one user account;

the map building module is used for building a knowledge map based on the plurality of user accounts and the incidence relation among the user accounts; wherein the knowledge-graph comprises a plurality of nodes, each node corresponding to one of the plurality of user accounts;

the conversion module is used for acquiring a plurality of account characteristic vectors according to the plurality of nodes, wherein each account characteristic vector corresponds to one node in the plurality of nodes;

the extraction module is used for extracting a plurality of text feature vectors of the user texts, wherein each text feature vector corresponds to one user text;

the vector splicing module is used for splicing each user account feature vector with one or more corresponding text feature vectors to obtain a plurality of fusion feature vectors;

and the training module is used for taking the plurality of fusion feature vectors as a plurality of groups of training samples, respectively inputting each group of training samples into a classification model to be trained, and training the classification model to be trained to obtain the sensitive content recognition model.

An aspect of the embodiments of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method for training the sensitive content recognition model or the method for text recognition as described above.

An aspect of the embodiments of the present invention further provides a computer-readable storage medium, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method for training the sensitive content recognition model or the method for text recognition as described above.

The embodiment of the invention provides a training method and a text recognition method of a sensitive content recognition model, wherein the training sample of the embodiment is determined by analyzing the relevance between accounts and texts, a user who issues a sensitive content text usually registers a plurality of accounts, and the sensitive information issued by the user through one account is restricted; the sensitive content text is usually sent again through other account numbers in the same device or/and the same gateway or/and the same IP address or/and the same time period, the classifier is trained after two different types of features of the user text and the associated account number are fused through vectors by the characteristic, and compared with the method of training the classifier through text information, the classifier of the embodiment is added with an associated label (namely the user account number corresponding to the user text) during training, so that the identification accuracy and the training efficiency of the sensitive content identification model are improved.

Drawings

FIG. 1 is a schematic diagram illustrating an environmental application of a method for training a sensitive content recognition model according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for training a sensitive content recognition model according to a first embodiment of the invention;

FIG. 3 is a diagram illustrating sub-steps of step S201 in FIG. 2;

FIG. 4 is a diagram illustrating sub-steps of step S203 in FIG. 2;

FIG. 5 schematically shows a flow chart according to a method for text recognition according to a second embodiment of the invention;

FIG. 6 schematically shows a block diagram of a training system according to a third embodiment of the invention; and

fig. 7 schematically shows a hardware architecture diagram of a computer device suitable for implementing the training method of the sensitive content recognition model according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the descriptions relating to "first", "second", etc. in the embodiments of the present invention are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

In the description of the present invention, it should be understood that the numerical references before the steps do not identify the order of performing the steps, but merely serve to facilitate the description of the present invention and to distinguish each step, and thus should not be construed as limiting the present invention.

The following are explanations of terms involved in the present invention:

knowledge map (Knowledge Graph) is a series of different graphs displaying Knowledge development process and structure relationship in the book intelligence field, describing Knowledge resources and carriers thereof by using visualization technology, mining, analyzing, constructing, drawing and displaying Knowledge and mutual relation between Knowledge resources and Knowledge carriers. In the embodiment of the present invention, the knowledge Graph refers to a Multi-relational Graph (Multi-relational Graph), and includes multiple types of nodes and multiple types of edges, and the nodes in the Graph are usually expressed by "Entity" and the "edges in the Graph are usually expressed by" relationship ". The entities refer to things in the real world, in this embodiment, the entities are user accounts, and the relationship is used to express some kind of connection between different entities.

Convolutional Neural Networks (CNN), which is a kind of feed forward Neural Networks (fed forward Neural Networks) containing convolution calculation and having a deep structure, is one of the representative algorithms of deep learning (deep learning). The convolutional neural network has a representation learning (representation learning) capability, and can perform shift-invariant classification (shift-invariant classification) on input information according to a hierarchical structure thereof, and in the embodiment, the convolutional neural network is used for feature extraction.

A classifier, the conventional task of which is to learn classification rules using given classes, known training data, and then classify (or predict) unknown data; as a preferred embodiment of this embodiment, there are three classifiers used in this embodiment, which are an SVM classifier, an XGBoost classifier, and a Logisti classifier respectively;

wherein, SVM (also called support vector machines) belongs to the classification model of two classes. The classification idea is that a sample set containing a positive example and a negative example is given, and the aim is to search a hyperplane to segment samples according to the positive example and the negative example, so that points closer to the hyperplane can have larger intervals; in this embodiment, the selected positive example is a non-sensitive content text fusion feature vector, and the selected negative example is a sensitive content text fusion feature vector;

the XGboost is to use a CART regression tree as a base classifier to continuously generate new CART regression trees, wherein when one tree is generated, a new function is learned, each sample is mapped to a uniquely determined leaf node by the function, all samples in the same leaf node share the same predicted value (namely the predicted result of each sample is the sum of the predicted scores of each tree), and the objective of the function is to fit the historical residuals of the samples in all the leaf nodes, so that the optimal tree model is found and added into the overall model.

The Logistic classifier is a classifier which is modeled by taking Bernoulli distribution as a model, belongs to a two-classification method like an SVM classifier, and can divide data into two types of 0 and 1, wherein the Logistic classification comprises linear summation, sigmoid function activation, error calculation and parameter correction.

FIG. 1 is a schematic diagram illustrating an environmental application of a method for training a sensitive content recognition model according to an embodiment of the present invention. In an exemplary embodiment, as shown in fig. 1, cloud server 2 may be connected to computer device 6 through network 4.

The cloud server 2 may provide a query and download service such as user text and a user account for the computer device 6 through the network 4.

The cloud server 2 may be a device such as: rack-mounted servers, blade servers, tower servers, or rack servers (including independent servers or a server cluster composed of a plurality of servers), and the like.

Network 4 may include various network devices such as routers, switches, multiplexers, hubs, modems, bridges, repeaters, firewalls, proxy devices, and/or the like. The network 4 may include physical links, such as coaxial cable links, twisted pair cable links, fiber optic links, combinations thereof, and/or the like. The network 4 may include wireless links such as cellular links, satellite links, Wi-Fi links, and/or the like.

A computer device 6 may be configured to access the cloud server 2. The computer device 6 may comprise any type of computer device, such as: personal computer equipment, or a rack server, a blade server, a tower server or a cabinet server (including an independent server or a server cluster composed of a plurality of servers), and the like.

To achieve the training effect, the computer device 6 may install a knowledge-graph construction tool. The knowledge-graph construction tool may construct a knowledge-graph from material stored on the computer device 6.

In order to make the computer device 6 more convenient to process data, the computer device 6 is also configured with a plurality of mathematical models in advance, and the data processing and operation are realized through the plurality of mathematical models.

The following describes an exemplary training scheme of the sensitive content recognition model provided by the present invention with the computer device 6 as the executing subject.

Example one

Fig. 2 schematically shows a flowchart of a method for training a sensitive content recognition model according to a first embodiment of the present invention.

As shown in FIG. 2, the method for training the sensitive content recognition model may include steps S200 to S205, wherein:

step S200, acquiring a plurality of user texts and a plurality of user accounts; each user text is sensitive content text or non-sensitive content text, and each user text is respectively associated with one user account.

The computer device 6 may define a user group as required, and query and download the text (i.e., the user text) published by the corresponding user group and the account (i.e., the user account) associated with the published text on the cloud server 2.

Step S201, establishing a knowledge graph based on the plurality of user accounts and the incidence relation among the user accounts; the knowledge graph comprises a plurality of nodes, and each node corresponds to one of the user accounts.

For example: the construction of the knowledge graph generally comprises the steps of knowledge extraction, knowledge fusion, data model construction and quality evaluation;

in the embodiment, the multiple user accounts and the association relationship among the user accounts are unstructured data, and the knowledge extraction of the unstructured data can be divided into three steps; one is entity extraction, also referred to as named entity recognition, where an entity in this embodiment may include a user account. Second, relationship extraction, that is, the relationship between entities, where the relationship in this embodiment includes the association relationship between user accounts. And thirdly, extracting attributes, namely attribute information of the entity is similar to the relationship, the relationship reflects external contact of the entity, the attributes reflect internal characteristics of the entity, and the attributes are represented as account information in the embodiment.

Knowledge fusion is a process of integrating knowledge in a plurality of knowledge bases to form a knowledge base, and in the process, entity alignment is mainly required to be solved. The knowledge base in the embodiment emphasizes the description of the relationship between the entity and other entities, and the purpose of knowledge fusion is to integrate the description of the entity by different knowledge bases so as to obtain the complete description of the entity;

the data model construction is a data organization framework of the knowledge graph, and the knowledge graph can be constructed in a top-down mode, namely the data model of the knowledge graph is determined firstly.

The computer device 6 may utilize a knowledge-graph construction tool to construct a knowledge-graph from which relationships between individual accounts are determined.

As a preferred scheme, the embodiment describes an optimal construction scheme of the knowledge graph mainly through processes of knowledge extraction, knowledge fusion and data model construction;

as an example, as shown in FIG. 3, constructing a knowledge graph may include the following steps S201-1 to S201-3.

Step S201-1, obtaining account information of each user account of the plurality of user accounts to obtain a plurality of user account information.

Step S201-2, acquiring a plurality of groups of associated accounts based on the information of the plurality of user accounts; the user account information comprises registration login information of corresponding user accounts, and each group of associated accounts comprises two user accounts with the same at least one user account information.

Step S201-3, constructing a knowledge graph according to a plurality of groups of associated accounts; each user account corresponds to a node in the knowledge graph, and the same user account information between two user accounts of each group of associated accounts is used for constructing an edge between the two corresponding nodes.

For example: the computer device 6 queries the registration login information of the corresponding user account from the cloud server 2, where the registration login information may include, but is not limited to, registration time of the user account, user login time, and an IP address where the user logs in, a region where the user is located, a name of the user device, a serial number of the device, and an MAC address. The computer equipment 6 automatically acquires two user accounts with the same at least one piece of registration login information, wherein the two user accounts serve as a group of associated accounts; for example, the distance range between users configured through the user interface is 200 meters, and the computer device 6 determines two user accounts within 200 meters of the distance range as a group of associated accounts; by configuring the interval time range of registration of the user account to be 2 hours, the computer device 6 determines two user accounts registered within the interval time range of 2 hours as a group of associated accounts. The computer device 6 inputs sets of associated accounts and corresponding user account information into the knowledge-graph construction tool to generate a knowledge-graph.

Step S202, a plurality of account characteristic vectors are obtained according to the plurality of nodes, and each account characteristic vector corresponds to one node in the plurality of nodes.

As an example, the adjacency matrix describes the connections between nodes in the knowledge-graph. Assume that the adjacency matrix is a | V | x | V | matrix, where | V | is the number of nodes in the knowledge-graph. Each column and each row in the matrix represents a node. A non-zero value in the matrix indicates that two nodes are connected. The adjacency matrix is used as a feature space of a large graph for operation, and the data size of the operation is huge, so in order to make the operation simpler and faster, a graph embedding mode is usually adopted to pack the node attributes into a vector with a smaller dimension; the present embodiment introduces a preferred way in the graph embedding method by the following example to convert the nodes into account feature vectors with low dimensionality.

As an example, account feature vectors may be generated for the respective nodes by: step S202-1: embedding a plurality of nodes and a plurality of edges corresponding to a plurality of groups of associated accounts on the knowledge graph into a target function, and calculating a plurality of account characteristic vectors corresponding to the plurality of nodes through the target function; the objective function is:

For example, the weight E of the edge_ijThe calculation method of (2) is as follows: assuming that the i account is simultaneously associated with the j account and the h account, the association frequency of the i account and the j account is 2, and the association frequency of the i account and the h account is 3, the weight is calculated

The association times refer to the number of the same user account information between two accounts of each group of associated accounts; for example, the distance between users configured through the user interface ranges from 200 meters; the distance between the i account and the j account is 100 meters, namely the i account and the j account have the same user account information; similarly, the interval time range of the registration of the user account is configured to be 2 hours; the interval time of registration of the i account and the j account is 1 hour; i.e. there are two identical user account information between the i and j accounts.

Step S203: extracting a plurality of text feature vectors of the plurality of user texts, wherein each text feature vector corresponds to one user text.

The text representation is the basic work in the natural language processing, and the performance of the whole natural language processing system is directly influenced by the quality of the text representation. Text vectorization is to represent a text into a series of vectors representing text semantics for a computer to recognize, and is an important way for representing the text; a commonly used text vectorization method has a bag-of-words model, and the bag-of-words model has the following problems: word order information cannot be reserved, and semantic deviation exists; therefore, the embodiment provides a preferable scheme of text vectorization, which is used for solving the problems existing in the bag-of-words model;

the step of extracting a plurality of text feature vectors of the plurality of user texts comprises:

s203-1, preprocessing the user texts to obtain a plurality of vector matrixes;

and S203-2, inputting the vector matrixes into a convolutional neural network to obtain a plurality of text feature vectors corresponding to the vector matrixes, wherein each vector matrix corresponds to one text feature vector in the text feature vectors.

In order to eliminate irrelevant information in the user text and enable the computer device 6 to better understand the semantics of the text, the embodiment performs natural language preprocessing on the text before text vectorization, and a preferred scheme is provided for the preprocessing of the user text;

preferably, as shown in fig. 4, the step of preprocessing the user texts to obtain a plurality of vector matrices includes:

step S2031-1, performing word segmentation processing on each sentence in a plurality of user texts to obtain a word segmentation set of each user text;

the word segmentation processing is to divide each sentence in the user texts into a plurality of words according to a grammar rule;

by way of example, the word segmentation processing computer device 6 may perform operations by means of the following word segmentation tools: HanLP, FudanNLP, LTP, THULAC, NLPIR, BosonNLP, Baidu NLP, Tencent Wenzhi or Ariyun NLP, and the rules of word segmentation by the word segmentation tools belong to the prior art and are not specifically explained herein.

As a preferred scheme, after a word segmentation set of each user text is obtained, a latest stop word list can be downloaded on an open source platform such as Github through words in the word segmentation set, and stop words are removed from the text after word segmentation according to the stop word list; the stop word refers to that in the information retrieval, in order to save storage space and improve search efficiency, some characters or words are automatically filtered before or after processing natural language data (or text), and the characters or words are called stop words.

S2031-2, coding each word in each participle set to convert each word in each participle set into a corresponding word vector;

s2031-3, taking the sentence of each user text as a unit, and acquiring a vector matrix of each sentence; the vector matrix is constructed from a plurality of word vectors corresponding to a plurality of words of a respective sentence, each row of the vector matrix corresponding to a word vector.

As a preferred scheme, the objective function in the embodiment of the present invention is an algorithm based on a vector space, and the original content of the user is discrete features, and in order to make the distance calculation between the discrete features more reasonable and make each feature be regarded as a continuous feature, in the embodiment, each word in each sentence is encoded by using one-hot encoding; the word segmentation principle is illustrated by the following example, for example, a plurality of user texts has the following three characteristic attributes:

weighing: [ "i", "you", "he", "she" ], "i" code is represented as 1000, "you" code is represented as 0100, "he" code is represented as 0010, "she" code is represented as 0001;

time word: [ "yesterday", "today", "tomorrow", "acquired" ], wherein the "yesterday" code is denoted 1000, the "today" code is denoted 0100, the "tomorrow" code is denoted 0010, and the "acquired" code is denoted 0001;

mood: [ "happy", "angry", "sadness", "fear" ]; wherein, the 'happy' code is represented as 1000, the 'angry' code is represented as 0100, the 'sadness' code is represented as 0010, and the 'fear' code is represented as 0001;

the existing user text comprises [ "i", "today", "very happy" ], and the original content of the user becomes: [1,0,0,0,0,1,0,0,1,0,0,0].

The operation of merging word vectors into a vector matrix is illustrated by the following example, assuming that a user text contains 3 word vectors, which are: "i", "today", "very happy", and "the above 3 word vectors are expressed as:

as an example, to implement vector conversion on user text, the training method of the sensitive content recognition model may further include the following steps:

as a preferable scheme, the convolutional neural network model adopted in the present embodiment is preferably TextCNN; the workflow generally includes: firstly, embedding a vector matrix in a text into a Word2vector model for training, wherein the purpose of training through the model is as follows: converting the vector matrix into sentence vectors for machine recognition, and then defining convolution kernels with different sizes to extract features based on the size of the vector matrix; then, carrying out pooling through a pooling layer, wherein the pooling process comprises the following steps: screening out at least one maximum feature; and then the features are spliced into a text feature vector.

As an example, in order to implement the splicing operation between vectors, the training method of the sensitive content recognition model may further include the following steps:

and S204, splicing each user account feature vector with one or more corresponding text feature vectors to obtain a plurality of fusion feature vectors.

Fusing the characteristics of two dimensions (namely account characteristic vector and text characteristic vector) through a formula

Performing a vector stitching operation wherein_iRepresents the fused feature vector, phi_i,1Represents an account feature vector, phi_i,2Representing a feature vector;

representing a vector stitching operation.

And S205, taking the plurality of fusion feature vectors as a plurality of groups of training samples, and respectively inputting each group of training samples into a classification model to be trained so as to train the classification model to be trained to obtain the sensitive content recognition model.

As a further preferred mode of this embodiment, in order to ensure that the predicted output is as consistent as possible with the expected output, the user text of this embodiment is provided with two kinds of labels, i.e., a sensitive content text and a non-sensitive content text, during the training process of this embodiment; on one hand, the number of sensitive content texts in each group of training samples for training different classifiers is the same, wherein each classifier corresponds to one sensitive content recognition model; the number of the non-sensitive content texts in each group of training samples is the same; on the other hand, the cross entropy of the predicted output and the desired output is calculated by the following formula, the parameters in the classifier are adjusted by the calculated cross entropy so that the value of the calculated cross entropy approaches 0, and the specific formula for calculating the value of the cross entropy is as follows:

wherein Loss is the value of cross entropy, N is the number of training samples, y_iIs a sample x_iDesired output of f (x)_i) Is a sample x_iThe prediction of (2).

Example two

Fig. 5 schematically shows a flow chart of a text recognition method according to a second embodiment of the invention.

As shown in FIG. 5, the method for training the sensitive content recognition model may include steps S300-S305, wherein:

step S300, determining a target user text to be processed; and extracting the text feature vector of the target user text.

The method for extracting the text feature vector of the target user text is the same as the corresponding method in the embodiment, and is not described herein again.

Step S301, searching a target user account related to the target user text according to the target user text;

for example, the computer device 6 in this embodiment searches for a target user account publishing the target user text through the cloud server 2.

Step S302, acquiring an account feature vector corresponding to the target user account according to the target user account;

preferably, the computer device 6 may query, in the constructed knowledge graph according to the first embodiment, a target node corresponding to the target user account, other nodes associated with the target node, and edges between the target node and the other nodes; by embedding a target node, other nodes, and edges between the target node and the other nodes into the target function provided in the first embodiment, an account feature vector corresponding to the target node is calculated by the target function.

Step S303, splicing the account number feature vector and the text feature vector to obtain a fusion feature vector;

and S304, inputting the fusion feature vector into a trained sensitive content recognition model so as to output a target text type of the target user text through the sensitive content recognition model, wherein the target text type is a sensitive content text or a non-sensitive content text.

The step of inputting the fusion feature vector into a trained sensitive content recognition model to output a target text type of the target user text through the sensitive content recognition model includes:

step S304-1, inputting the fusion feature vector into each classifier of the plurality of classifiers to obtain a plurality of text types; the plurality of text types correspond to a plurality of output results of the plurality of classifiers one by one; and determining the text type with the number ratio larger than a preset threshold value as the target text type according to the plurality of text types. Wherein the sensitive content identification model comprises a plurality of classifiers, and the preset threshold is preferably 1/2.

Illustratively, after different classifiers are trained by multiple groups of training samples, the obtained recall ratio and precision ratio are different, a single classifier is adopted to make a decision on the target type of the target user text, so that a large prediction error exists, and in order to further reduce the prediction error of the target type of the target user text, as a preferred implementation manner of the embodiment, the three classifiers adopted in the embodiment are respectively an SVM, a BooXGate and a Logiti; voting for the target type of the target text by the three classifiers at the same time, wherein the voting result is 0 or 1, 0 represents that the target text type is a non-sensitive content text, and 1 represents that the target text type is a sensitive content text; for example, two of the three classifiers recognize the target type of 0 for the target user text, and the other classifier recognizes the target type of 1 for the target user text, that is, the result of the target user text recognition is the non-sensitive content text.

EXAMPLE III

FIG. 6 schematically illustrates a block diagram of a training system that may be partitioned into program modules, one or more of which are stored in a storage medium and executed by a processor, to implement a third embodiment of the invention. The program modules referred to in the embodiments of the present invention refer to a series of computer program instruction segments that can perform specific functions, and the following description will specifically describe the functions of the program modules in the embodiments.

As shown in fig. 6, the training system 130 may include an acquisition module 131, a graph construction module 132, a transformation module 133, an extraction module 134, a vector stitching module 135, and a training module 136. Wherein:

an obtaining module 131, configured to obtain a plurality of user texts and a plurality of user accounts; each user text is a sensitive content text or a non-sensitive content text, and each user text is respectively associated with one user account;

the map construction module 132 is configured to construct a knowledge map based on the plurality of user accounts and the association relationship between the user accounts; wherein the knowledge-graph comprises a plurality of nodes, each node corresponding to one of the plurality of user accounts;

a conversion module 133, configured to obtain a plurality of account feature vectors according to the plurality of nodes, where each account feature vector corresponds to one node in the plurality of nodes;

an extracting module 134, configured to extract a plurality of text feature vectors of the plurality of user texts, where each text feature vector corresponds to one user text;

a vector stitching module 135, configured to stitch each user account feature vector with one or more corresponding text feature vectors to obtain multiple fused feature vectors;

the training module 136 is configured to use the multiple fusion feature vectors as multiple groups of training samples, and input each group of training samples into a classification model to be trained, so as to train the classification model to be trained, so as to obtain the sensitive content recognition model.

The map building module 132 is further configured to obtain account information of each user account of the multiple user accounts to obtain multiple user account information; acquiring a plurality of groups of associated accounts based on the plurality of user account information; the user account information comprises registration login information of corresponding user accounts, and each group of associated accounts comprises two user accounts with the same at least one user account information; constructing a knowledge graph according to the multiple groups of associated accounts; each user account corresponds to a node in the knowledge graph, and the same user account information between two user accounts of each group of associated accounts is used for constructing an edge between the two corresponding nodes.

The conversion module 133 is further configured to embed a plurality of nodes and a plurality of edges corresponding to a plurality of groups of associated accounts on the knowledge graph into an objective function, and calculate a plurality of account feature vectors corresponding to the plurality of nodes through the objective function, where the objective function is:

The extracting module 134 is further configured to pre-process the user texts to obtain a plurality of vector matrices; inputting the plurality of vector matrices into a convolutional neural network to obtain a plurality of text feature vectors corresponding to the plurality of vector matrices, each vector matrix corresponding to one text feature vector of the plurality of text feature vectors.

The extraction module 134 is further configured to perform word segmentation processing on each sentence in the multiple user texts to obtain a word segmentation set of each user text; coding each word in each participle set so as to convert each word in each participle set into a corresponding word vector; taking sentences of each user text as a unit, and acquiring a vector matrix of each sentence; the vector matrix is constructed from a plurality of word vectors corresponding to a plurality of words of a respective sentence, each row of the vector matrix corresponding to a word vector.

Example four

Fig. 7 schematically shows a hardware architecture diagram of a computer device 6 suitable for implementing the training of the sensitive content recognition model and the text recognition according to the fourth embodiment of the present invention. In the present embodiment, the computer device 6 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set or stored in advance. For example, the server may be a personal computer, a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers), a gateway, and the like. As shown in fig. 7, the computer device 6 includes at least, but is not limited to: memory 141, processor 142, network interface 143 may be communicatively linked to each other via a system bus; wherein:

the memory 141 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 141 may be an internal storage module of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 141 may also be an external storage device of the computer device 6, such as a plug-in hard disk provided on the computer device 6, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Of course, memory 141 may also include both internal and external memory modules of computer device 6. In this embodiment, the memory 141 is generally used for storing an operating system installed in the computer device 6 and various application software, such as program codes of virus scanning and displaying methods. Further, the memory 141 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 142 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 142 is generally configured to control the overall operation of the computer device 6, such as performing control and processing related to data interaction or communication with the computer device 6. In this embodiment, the processor 142 is used to execute program codes stored in the memory 141 or process data.

Network interface 143 may comprise a wireless network interface or a wired network interface, with network interface 143 typically being used to establish communication links between computer device 6 and other computer devices. For example, the network interface 143 is used to connect the computer device 6 with an external terminal via a network, establish a data transmission channel and a communication link between the computer device 6 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), or Wi-Fi.

It is noted that fig. 7 only shows a computer device with components 141 and 143, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the method for training the sensitive content recognition model and recognizing the text stored in the memory 141 may be further divided into one or more program modules and executed by a processor (in this embodiment, the processor 142) to implement the embodiment of the present invention.

EXAMPLE five

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for training a sensitive content recognition model and text recognition in embodiments.

In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the computer-readable storage medium may also include both internal and external storage devices of the computer device. In this embodiment, the computer-readable storage medium is generally used to store an operating system and various types of application software installed in a computer device, for example, program codes of the embodiment that implement the method for training the sensitive content recognition model and recognizing the text. Further, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A training method for a sensitive content recognition model is characterized by comprising the following steps:

acquiring a plurality of user texts and a plurality of user accounts; each user text is a sensitive content text or a non-sensitive content text, and each user text is respectively associated with one user account;

constructing a knowledge graph based on the plurality of user accounts and the incidence relation among the user accounts; wherein the knowledge-graph comprises a plurality of nodes, each node corresponding to one of the plurality of user accounts;

obtaining a plurality of account feature vectors according to the plurality of nodes, wherein each account feature vector corresponds to one node in the plurality of nodes;

extracting a plurality of text feature vectors of the plurality of user texts, wherein each text feature vector corresponds to one user text;

splicing each user account feature vector with one or more corresponding text feature vectors to obtain a plurality of fusion feature vectors; and

and taking the plurality of fusion feature vectors as a plurality of groups of training samples, respectively inputting each group of training samples into a classification model to be trained, and training the classification model to be trained to obtain the sensitive content recognition model.

2. The method for training the sensitive content recognition model according to claim 1, wherein the step of constructing the knowledge graph based on the plurality of user accounts and the association relationship between the user accounts comprises:

acquiring account information of each user account in the plurality of user accounts to obtain a plurality of user account information;

acquiring a plurality of groups of associated accounts based on the plurality of user account information; the user account information comprises registration login information of corresponding user accounts, and each group of associated accounts comprises two user accounts with the same at least one user account information; and

constructing a knowledge graph according to the multiple groups of associated accounts; each user account corresponds to a node in the knowledge graph, and the same user account information between two user accounts of each group of associated accounts is used for constructing an edge between the two corresponding nodes.

3. The method for training the sensitive content recognition model according to claim 1, wherein the step of obtaining a plurality of account feature vectors according to the plurality of nodes comprises:

embedding a plurality of nodes and a plurality of edges corresponding to a plurality of groups of associated accounts on the knowledge graph into an objective function, and calculating a plurality of account feature vectors corresponding to the plurality of nodes through the objective function, wherein the objective function is as follows:

4. The method for training a sensitive content recognition model according to claim 1, wherein the step of extracting a plurality of text feature vectors of the user texts, each text feature vector corresponding to a user text, comprises,

preprocessing the user texts to obtain a plurality of vector matrixes;

inputting the plurality of vector matrices into a convolutional neural network to obtain a plurality of text feature vectors corresponding to the plurality of vector matrices, each vector matrix corresponding to one text feature vector of the plurality of text feature vectors.

5. The method for training the sensitive content recognition model according to claim 4, wherein the step of preprocessing the user texts to obtain vector matrices comprises:

performing word segmentation processing on each sentence in a plurality of user texts to obtain a word segmentation set of each user text;

coding each word in each participle set so as to convert each word in each participle set into a corresponding word vector;

taking sentences of each user text as a unit, and acquiring a vector matrix of each sentence; the vector matrix is constructed from a plurality of word vectors corresponding to a plurality of words of a respective sentence, each row of the vector matrix corresponding to a word vector.

6. A text recognition method, comprising:

determining a target user text to be processed, and extracting a text feature vector of the target user text;

searching a target user account related to the target user text according to the target user text;

acquiring an account feature vector corresponding to the target user account according to the target user account;

splicing the account characteristic vector and the text characteristic vector to obtain a fusion characteristic vector; and

inputting the fusion feature vector into a trained sensitive content recognition model so as to output a target text type of the target user text through the sensitive content recognition model, wherein the target text type is a sensitive content text or a non-sensitive content text, and the sensitive content recognition model is obtained by training the sensitive content recognition model in any one of claims 1 to 5.

7. The text recognition method of claim 6, wherein the sensitive content recognition model comprises a plurality of classifiers, and the step of inputting the fused feature vector into the trained sensitive content recognition model to output the target text type of the target user text through the sensitive content recognition model comprises:

inputting the fused feature vector into each classifier of the plurality of classifiers to obtain a plurality of text types; the plurality of text types correspond to a plurality of output results of the plurality of classifiers one by one; and

and determining the text type with the number ratio larger than a preset threshold value as the target text type according to the plurality of text types.

8. A system for training a sensitive content recognition model, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor is adapted to carry out the steps of the method for training a sensitive content recognition model according to any of claims 1-5 or the method for text recognition according to any of claims 6-7 when executing the computer program.

10. A computer-readable storage medium, having stored therein a computer program, which is executable by at least one processor to cause the at least one processor to perform the method for training a sensitive content recognition model according to any one of claims 1 to 5, or the steps of the method for text recognition according to any one of claims 6 to 7.