CN113254649B

CN113254649B - Training method of sensitive content recognition model, text recognition method and related device

Info

Publication number: CN113254649B
Application number: CN202110691212.9A
Authority: CN
Inventors: 成杰峰; 彭奕
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2023-07-18
Anticipated expiration: 2041-06-22
Also published as: CN113254649A

Abstract

The invention provides a training method of a sensitive content identification model, which comprises the following steps: acquiring a plurality of user texts and a plurality of user accounts; constructing a knowledge graph based on the plurality of user accounts and the association relation among the user accounts, wherein the knowledge graph comprises a plurality of nodes; acquiring a plurality of account feature vectors according to the plurality of nodes; extracting a plurality of text feature vectors of the plurality of user texts; splicing each user account feature vector with one or more corresponding text feature vectors to obtain a plurality of fusion feature vectors; and taking the fusion feature vectors as a plurality of groups of training samples, respectively inputting each group of training samples into a classification model to be trained, and training the classification model to be trained to obtain the sensitive content identification model. According to the invention, the user text and the user account are subjected to feature fusion, and the recognition accuracy and the training efficiency are improved through the feature training after fusion and the recognition of using sensitive content.

Description

Training method of sensitive content recognition model, text recognition method and related device

Technical Field

The invention relates to the field of artificial intelligence, in particular to a training method of a sensitive content recognition model, a text recognition method and a related device.

Background

Existing sensitive content recognition models typically employ supervised machine learning schemes, such as text classifiers based on CNN models; the identification capability of the text classifier on the sensitive content depends on the information quantity of the labeling sample; the amount of information that depends solely on the content itself is not sufficient to train a good text classifier, resulting in the existing text classifier not being able to correctly identify the sensitive information text categories that are not present in the annotation sample.

The existing training models such as ELMO (embedded language models) are huge in size, and the huge model size makes a method for performing text processing by applying a pre-training model require a lot of time, so that the method is difficult to apply to actual scenes, and therefore, it is very important to find an accurate, efficient and convenient-to-apply text training method.

Disclosure of Invention

The invention aims to provide a training method, a text recognition method, computer equipment and a computer readable storage medium of a sensitive content recognition model, which are used for solving the following problems: the prior art does not correctly identify the text categories of sensitive information that are not present in the annotation sample.

A first aspect of an embodiment of the present invention provides a training method for a sensitive content identification model, including:

acquiring a plurality of user texts and a plurality of user accounts; each user text is a sensitive content text or a non-sensitive content text, and each user text is respectively associated with one user account; constructing a knowledge graph based on the plurality of user accounts and the association relation among the user accounts; the knowledge graph comprises a plurality of nodes, and each node corresponds to one user account in the plurality of user accounts; acquiring a plurality of account feature vectors according to the plurality of nodes, wherein each account feature vector corresponds to one node in the plurality of nodes; extracting a plurality of text feature vectors of the plurality of user texts, wherein each text feature vector corresponds to one user text; splicing each user account feature vector with one or more corresponding text feature vectors to obtain a plurality of fusion feature vectors; and taking the fusion feature vectors as a plurality of groups of training samples, respectively inputting each group of training samples into a classification model to be trained so as to train the classification model to be trained, thereby obtaining the sensitive content identification model.

Optionally, the step of constructing a knowledge graph based on the plurality of user accounts and the association relationship between the user accounts includes: acquiring account information of each user account in the plurality of user accounts to obtain a plurality of user account information; acquiring a plurality of groups of associated accounts based on the plurality of user account information; the user account information comprises registration login information of corresponding user accounts, and each group of associated accounts comprises two user accounts with the same at least one user account information; constructing a knowledge graph according to a plurality of groups of associated account numbers; each user account corresponds to one node in the knowledge graph, and the same user account information between the two user accounts of each group of associated accounts is used for constructing an edge between the corresponding two nodes.

Optionally, the step of obtaining a plurality of account feature vectors according to the plurality of nodes includes: embedding a plurality of nodes and a plurality of edges corresponding to a plurality of groups of associated accounts on the knowledge graph into an objective function, and calculating a plurality of account feature vectors corresponding to the plurality of nodes through the objective function, wherein the objective function is as follows:

Wherein E is _ij Representing the weight of an edge; phi (u) _i )、φ(u _j ) Respectively represent the ith node v _i The j-th node v _j ；φ'(u _j ) Representing node v _j Adjacent node of u _i 、u _j Respectively represent node v _i Node v _j Account feature vector representation of (a).

Optionally, the step of extracting a plurality of text feature vectors of the plurality of user texts, each text feature vector corresponding to one user text, includes: preprocessing the text of the plurality of users to obtain a plurality of vector matrixes; the plurality of vector matrices are input to a convolutional neural network to obtain a plurality of text feature vectors corresponding to the plurality of vector matrices, each of the vector matrices corresponding to one of the plurality of text feature vectors.

Optionally, the step of preprocessing the text of the plurality of users to obtain a plurality of vector matrixes includes: performing word segmentation processing on each sentence in the plurality of user texts to obtain a word segmentation set of each user text; encoding each word in each word segmentation set to convert each word in each word segmentation set into a corresponding word vector; taking sentences of each user text as units, and acquiring a vector matrix of each sentence; the vector matrix is constructed according to a plurality of word vectors corresponding to a plurality of words of the corresponding sentence, and each row of the vector matrix corresponds to one word vector.

An aspect of the embodiment of the present invention further provides a text recognition method, including: determining a target user text to be processed; extracting text feature vectors of the target user text according to the target user text; searching a target user account associated with the target user text; according to the target user account, acquiring an account characteristic vector corresponding to the target user account; splicing the account feature vector and the text feature vector to obtain a fusion feature vector; and inputting the fusion feature vector into a trained sensitive content recognition model to output the text type of the target user text through the sensitive content recognition model, wherein the text type is sensitive content text or non-sensitive content text, and the sensitive content recognition model is a model trained by the training method of the sensitive content recognition model.

Optionally, the sensitive content recognition model includes a plurality of classifiers, and the step of inputting the fused feature vector into the trained sensitive content recognition model to output the target text type of the target user text through the sensitive content recognition model includes:

Inputting the fusion feature vector into each classifier in the plurality of classifiers to obtain a plurality of text types; wherein the text types are in one-to-one correspondence with the output results of the classifiers; and determining the text types with the quantity ratio larger than a preset threshold value as the target text types according to the plurality of text types.

An aspect of an embodiment of the present invention further provides a training system for a sensitive content identification model, including:

the acquisition module is used for acquiring a plurality of user texts and a plurality of user accounts; each user text is a sensitive content text or a non-sensitive content text, and each user text is respectively associated with one user account;

the map construction module is used for constructing a knowledge map based on the plurality of user accounts and the association relation among the user accounts; the knowledge graph comprises a plurality of nodes, and each node corresponds to one user account in the plurality of user accounts;

the conversion module is used for acquiring a plurality of account feature vectors according to the plurality of nodes, and each account feature vector corresponds to one node in the plurality of nodes;

The extraction module is used for extracting a plurality of text feature vectors of the user texts, and each text feature vector corresponds to one user text;

the vector splicing module is used for splicing each user account feature vector and one or more corresponding text feature vectors to obtain a plurality of fusion feature vectors;

the training module is used for taking the fusion feature vectors as a plurality of groups of training samples, respectively inputting each group of training samples into a classification model to be trained, and training the classification model to be trained to obtain the sensitive content identification model.

An aspect of an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the training method or the text recognition method of the sensitive content recognition model as described above when the computer program is executed.

An aspect of an embodiment of the present invention further provides a computer readable storage medium including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the training method or the text recognition method of the sensitive content recognition model as described above when the computer program is executed.

The embodiment of the invention provides a training method and a text recognition method for a sensitive content recognition model, wherein a training sample of the embodiment is determined by analyzing the relevance of accounts and texts, a user issuing a sensitive content text usually registers a plurality of accounts, and the user is restricted to act through sensitive information issued by one of the accounts; in general, in the same device or/and the same gateway or/and the same IP address or/and the same time period, the sensitive content text is sent again through other accounts, by means of the characteristic that the embodiment combines the features of the user text and the two different types of the associated account to train the classifier after vector fusion, compared with training the classifier only through text information, the classifier in the embodiment increases an associated tag (namely the user account corresponding to the user text) during training, so that the recognition accuracy and training efficiency of the sensitive content recognition model are improved.

Drawings

FIG. 1 schematically illustrates an environmental application schematic of a training method of a sensitive content identification model according to an embodiment of the present invention;

FIG. 2 schematically illustrates a flow chart of a training method of a sensitive content identification model according to a first embodiment of the present invention;

FIG. 3 is a sub-step diagram of step S201 in FIG. 2;

FIG. 4 is a sub-step diagram of step S203 in FIG. 2;

fig. 5 schematically shows a flowchart according to a text recognition method according to a second embodiment of the present invention;

FIG. 6 schematically illustrates a block diagram of a training system according to a third embodiment of the invention; a kind of electronic device with high-pressure air-conditioning system

Fig. 7 schematically shows a hardware architecture diagram of a computer device adapted to implement a training method of a sensitive content identification model according to a third embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the descriptions of "first," "second," etc. in the embodiments of the present invention are for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

In the description of the present invention, it should be understood that the numerical references before the steps do not identify the order in which the steps are performed, but are merely used to facilitate description of the present invention and to distinguish between each step, and thus should not be construed as limiting the present invention.

The following is an explanation of the terminology involved in the present invention:

the Knowledge map (knowledgegraph), called Knowledge domain visualization or Knowledge domain mapping map in book condition report, is a series of various graphs showing Knowledge development process and structural relationship, and uses visualization technology to describe Knowledge resources and their carriers, and excavate, analyze, construct, draw and display Knowledge and their interrelationships. In the embodiment of the present invention, the knowledge Graph refers to a Multi-Relation Graph (Multi-Relation Graph), which includes multiple types of nodes and multiple types of edges, and generally uses "Entity" to express the nodes in the Graph and "Relation" to express the "edges" in the Graph. An entity refers to something in the real world, in this embodiment a user account, and a relationship is used to express a certain relationship between different entities.

Convolutional neural networks (Convolutional Neural Networks, CNN), which are a type of feedforward neural network (Feedforward Neural Networks) that includes convolutional calculations and has a deep structure, are one of the representative algorithms of deep learning. The convolutional neural network has a token learning (representation learning) capability that enables a shift-invariant classification (shift-invariant classification) of the input information in its hierarchical structure, which in this embodiment is used for feature extraction.

A classifier, the conventional task is to learn classification rules with given classes, known training data, and then classify (or predict) unknown data; as the preferred implementation manner of the embodiment, three kinds of classifiers are adopted in the embodiment, namely three kinds of classifiers of SVM, XGBoost and Logistic respectively;

wherein the SVM (also called support vector machines) belongs to a classification model of two classifications. The classification idea is to give a sample set containing positive examples and negative examples, and the purpose of the classification idea is to find a hyperplane to divide the samples according to the positive examples and the negative examples, so that points closer to the hyperplane can have larger spacing; in this embodiment, the selected positive example is a non-sensitive content text fusion feature vector, and the selected negative example is a sensitive content text fusion feature vector;

XGBoost is to use CART regression trees as a base classifier to continuously generate new CART regression trees, each new CART regression tree is generated by learning a new function, each sample is mapped to a uniquely determined leaf node by the function, all samples in the same leaf node share the same predicted value (namely, the predicted result of each sample is the sum of the predicted scores of each tree), and the objective of the function is to fit the historical residual errors of the samples in all leaf nodes, so that the best tree model is found and added into the integral model.

The Logistic classifier is a classifier modeled by taking Bernoulli distribution as a model, belongs to a two-class classification method like an SVM classifier, can divide data into 0 class and 1 class, and comprises linear summation, sigmoid function activation, error calculation and parameter correction.

FIG. 1 schematically illustrates an environmental application schematic of a training method of a sensitive content identification model according to an embodiment of the present invention. In an exemplary embodiment, as shown in fig. 1, cloud server 2 may connect to computer device 6 over network 4.

The cloud server 2 may provide query and download services, such as user text and user account, for the computer device 6 over the network 4.

The cloud server 2 may be the following devices, such as: rack servers, blade servers, tower servers, or rack servers (including stand-alone servers, or a server cluster made up of multiple servers), and the like.

The network 4 may include various network devices such as routers, switches, multiplexers, hubs, modems, bridges, repeaters, firewalls, proxy devices, and/or the like. The network 4 may include physical links such as coaxial cable links, twisted pair cable links, fiber optic links, combinations thereof, and/or the like. The network 4 may include wireless links, such as cellular links, satellite links, wi-Fi links, and/or the like.

The computer device 6 may be configured to access the cloud server 2. The computer device 6 may comprise any type of computer device, such as: personal computer devices, or rack-mounted servers, blade servers, tower servers, or rack-mounted servers (including stand-alone servers, or a server cluster made up of multiple servers), and the like.

To achieve the training effect, the computer device 6 may install a knowledge graph construction tool. The knowledge-graph construction tool may construct a knowledge graph from data stored on the computer device 6.

In order to facilitate the processing of data by the computer device 6, the computer device 6 is also preconfigured with a plurality of mathematical models through which the processing and operation of data are realized.

An exemplary description will be made below of a training scheme of the sensitive content recognition model provided by the present invention with the computer device 6 as an execution subject.

Example 1

Fig. 2 schematically shows a flowchart of a training method of a sensitive content identification model according to a first embodiment of the invention.

As shown in fig. 2, the training method of the sensitive content identification model may include steps S200 to S205, in which:

step S200, a plurality of user texts and a plurality of user accounts are obtained; each user text is sensitive content text or non-sensitive content text, and each user text is respectively associated with one user account.

The computer device 6 may define a user group, as desired, and query and download text published by the corresponding user group (i.e., user text), and an account associated with the published text (i.e., user account) on the cloud server 2.

Step S201, constructing a knowledge graph based on the plurality of user accounts and the association relation among the user accounts; the knowledge graph comprises a plurality of nodes, and each node corresponds to one user account in the plurality of user accounts.

For example: the construction of the knowledge graph generally comprises the steps of knowledge extraction, knowledge fusion, data model construction and quality assessment;

in this embodiment, the association relationship between the multiple user accounts and each user account is unstructured data, and knowledge extraction on the unstructured data may be divided into three steps; once the entity is extracted, also referred to as named entity identification, the entity in this embodiment may include a user account. And secondly, relation extraction, namely relation between entities, wherein the relation in the embodiment comprises the association relation among all user accounts. And thirdly, extracting the attribute, namely the attribute information of the entity, and comparing the attribute information with the relationship, wherein the relationship reflects the external contact of the entity, the attribute reflects the internal characteristic of the entity, and the attribute is expressed as account information in the embodiment.

Knowledge fusion is a process of integrating knowledge in a plurality of knowledge bases to form a knowledge base, and in the process, the main problem to be solved is entity alignment. The knowledge base in the embodiment focuses on describing the relationship between the entity and other entities, and the purpose of knowledge fusion is to integrate descriptions of the entities by different knowledge bases so as to obtain complete descriptions of the entities;

the data model construction, namely the data organization framework of the knowledge graph, can adopt a top-down mode to construct the knowledge graph, namely the data model of the knowledge graph is firstly determined.

The computer device 6 may construct a knowledge-graph using a knowledge-graph construction tool, by which relationships between individual accounts are determined.

As a preferred scheme, in this embodiment, an exemplary description is made on a preferred construction scheme of the knowledge graph mainly through the processes of knowledge extraction, knowledge fusion and data model construction;

as an example, as shown in fig. 3, constructing the knowledge-graph may include the following steps S201-1 to S201-3.

Step S201-1, obtaining account information of each user account in the plurality of user accounts to obtain a plurality of user account information.

Step S201-2, acquiring a plurality of groups of associated accounts based on the plurality of user account information; the user account information comprises registration login information of corresponding user accounts, and each group of associated accounts comprises two user accounts with the same at least one user account information.

Step S201-3, constructing a knowledge graph according to a plurality of groups of associated account numbers; each user account corresponds to one node in the knowledge graph, and the same user account information between the two user accounts of each group of associated accounts is used for constructing an edge between the corresponding two nodes.

For example: the computer device 6 queries the cloud server 2 for registration login information of the corresponding user account, where the registration login information may include, but is not limited to, registration time of the user account, user login time, and IP address of the user login, the region where the user is located, the name of the user device, and the serial number of the device, and the MAC address. The computer device 6 automatically obtains two user accounts with the same at least one registered login information, wherein the two user accounts are used as a group of associated accounts; for example, the distance range between users configured through the user interface is 200 meters, and the computer device 6 determines two user accounts within the distance range of 200 meters as a set of associated accounts; by configuring the interval time range of registration of the user account to be 2 hours, the computer device 6 determines two user accounts within the interval time range of registration to be 2 hours as a group of associated accounts. The computer device 6 inputs the multiple sets of associated account numbers and corresponding user account number information into a knowledge-graph construction tool to generate a knowledge-graph.

Step S202, a plurality of account feature vectors are obtained according to the plurality of nodes, and each account feature vector corresponds to one node in the plurality of nodes.

As an example, the adjacency matrix describes the connections between nodes in the knowledge-graph. Let the adjacency matrix be a |v|x|v| matrix, where |v| is the number of nodes in the knowledge-graph. Each column and each row in the matrix represents a node. A non-zero value in the matrix indicates that two nodes are connected. The adjacency matrix is used as the characteristic space of the large-scale graph to carry out operation, and the data volume of the operation is huge, so that in order to make the operation simpler and quicker, a graph embedding mode is generally adopted to pack node attributes into a vector with smaller dimension; the present embodiment describes a preferred manner in the graph embedding method by the following example, so as to convert the various nodes into account feature vectors with low dimensionality.

As an example, account feature vectors may be generated for the respective nodes by: step S202-1: embedding a plurality of nodes and a plurality of edges corresponding to a plurality of groups of associated accounts on the knowledge graph into an objective function, and calculating a plurality of account feature vectors corresponding to the plurality of nodes through the objective function; the objective function is:

For example, the weight E of an edge _ij The calculation method of (2) is as follows: assuming that the i account number is simultaneously associated with the j account number and the h account number, the association times of the i account number and the j account number are 2, and the association times of the i account number and the h account number are 3, the weight is givenThe association times refer to the number of the same user account information between two accounts of each group of association accounts; for example, the distance range between users configured through the user interface is 200 meters; the distance between the i account and the j account is 100 meters, namely the same user account information is arranged between the i account and the j account; similarly, the interval time range of the registration of the user account is configured to be 2 hours; the interval time between the registration of the i account and the j account is 1 hour; namely, two identical user account information are arranged between the i account and the j account.

Step S203: a plurality of text feature vectors of the plurality of user texts are extracted, each of the text feature vectors corresponding to one of the user texts.

Text representation is the basic work in natural language processing, and the quality of text representation directly affects the performance of the overall natural language processing system. Text vectorization is to represent text into a series of vectors representing text semantics which can be recognized by a computer, and is an important way of text representation; the conventional text vectorization method has a bag-of-words model, and the bag-of-words model has the following problems: word order information cannot be reserved, and semantic deviation exists; the embodiment provides a text vectorization preferred scheme for solving the problems of the bag-of-words model;

The step of extracting a plurality of text feature vectors of the plurality of user texts comprises:

step 203-1, preprocessing the text of the plurality of users to obtain a plurality of vector matrixes;

step S203-2, inputting the plurality of vector matrixes into a convolutional neural network to obtain a plurality of text feature vectors corresponding to the plurality of vector matrixes, wherein each vector matrix corresponds to one text feature vector in the plurality of text feature vectors.

In order to remove irrelevant information in the text of the user, so that the computer equipment 6 can better understand the semantics of the text, the text is subjected to natural language preprocessing before vectorization of the text, and a preferred scheme is provided for preprocessing the text of the user;

preferably, as shown in fig. 4, the step of preprocessing the text of the plurality of users to obtain a plurality of vector matrices includes:

step S2031-1, performing word segmentation processing on each sentence in a plurality of user texts to obtain word segmentation sets of the user texts;

the word segmentation process is to divide each sentence in the plurality of user texts into a plurality of words according to grammar rules;

as an example, the word segmentation processing computer device 6 may perform operations with the help of the following word segmentation tools: hanLP, fudanNLP, LTP, THULAC, NLPIR, bosonNLP, hundred degree NLP, tengxue Wen Zhi, or Aliyun NLP, rules for word segmentation by the above word segmentation tools belong to the prior art and are not specifically described herein.

After the word segmentation set of each user text is obtained, the latest stop word list can be downloaded on open source platforms such as Github through words in the word segmentation set, and stop words are removed from the segmented text according to the stop word list; in information retrieval, certain words or words are automatically filtered before or after processing natural language data (or text) in order to save storage space and improve search efficiency, and are called as stop words.

Step S2031-2, encoding each word in each word segmentation set to convert each word in each word segmentation set into a corresponding word vector;

step S2031-3, taking sentences of each user text as units, and acquiring a vector matrix of each sentence; the vector matrix is constructed according to a plurality of word vectors corresponding to a plurality of words of the corresponding sentence, and each row of the vector matrix corresponds to one word vector.

As a preferred scheme, the objective function in the embodiment of the invention is an algorithm based on vector space, and the original content of the user is discrete characteristics, so that in order to make the distance calculation between the discrete characteristics more reasonable, each characteristic is regarded as a continuous characteristic, in the embodiment, each word in each sentence is encoded by adopting single-hot encoding; the word segmentation principle is illustrated by the following example, for example, a plurality of user texts have three characteristic attributes:

The people are as follows: [ "me", "you", "he", "she" ], the "me" code is denoted 1000, the "you" code is denoted 0100, the "he" code is denoted 0010, and the "her" code is denoted 0001;

time word: "yesterday", "today", "tomorrow", "postamble" ], wherein the "yesterday" code is denoted 1000, the "today" code is denoted 0100, the "tomorrow" code is denoted 0010, and the "postamble" code is denoted 0001;

mood: [ "very happy", "very anger", "sadness", "fear" ]; wherein, the code of "very happy" is expressed as 1000, the code of "very anger" is expressed as 0100, the code of "very sad" is expressed as 0010, and the code of "very fear" is expressed as 0001;

the existing user text comprises [ "me", "today", "happy" ], and original contents of the user become after being subjected to one-time thermal coding: [1,0,0,0,0,1,0,0,1,0,0,0].

The operation of combining word vectors into a vector matrix is illustrated by the following example, assuming that a user text contains 3 word vectors, respectively: "me", "today", "happy", "the above 3 word vectors are expressed as after conversion into a vector matrix:

As an example, to implement vector conversion on the user text, the training method of the sensitive content recognition model may further include the steps of:

as a preferred solution, the convolutional neural network model adopted in this embodiment is preferably TextCNN; the workflow generally includes: firstly, training is carried out by embedding a vector matrix in a text into a Word2vector model, and the aim of training by the model is that: converting the vector matrix into sentence vectors which can be identified by a machine, and defining convolution kernels with different sizes based on the sizes of the vector matrix to extract features; and then carrying out pooling through a pooling layer, wherein the pooling process is as follows: screening out at least one largest feature; and then splicing the characteristics into a text characteristic vector.

As an example, to implement the concatenation operation between vectors, the training method of the sensitive content identification model may further include the following steps:

and S204, splicing the feature vector of each user account and the corresponding one or more text feature vectors to obtain a plurality of fusion feature vectors.

Fusing the features of two dimensions (namely account feature vector and text feature vector) through a formula Performing vector concatenation operation, wherein phi _i Represents a fused feature vector, phi _i,1 Representing account feature vector, phi _i,2 Representing the feature vector; />Representing vector concatenation operations.

Step S205, taking the fusion feature vectors as a plurality of groups of training samples, and respectively inputting each group of training samples into a classification model to be trained so as to train the classification model to be trained, thereby obtaining the sensitive content identification model.

As a further preferable mode of the embodiment, in order to ensure that the predicted output and the expected output keep consistent as much as possible, the user text of the embodiment is provided with two labels, namely a sensitive content text and a non-sensitive content text, in the training process of the embodiment; in one aspect, the number of sensitive content texts in each set of training samples for training different classifiers is the same, wherein each classifier corresponds to one of the sensitive content recognition models; the number of non-sensitive content texts in each set of training samples is the same; on the other hand, the cross entropy of the predicted output and the expected output is calculated by the following formula, and the parameters in the classifier are adjusted by the calculated cross entropy so that the value of the calculated cross entropy approaches 0, and the specific formula for calculating the value of the cross entropy is as follows:

Wherein Loss is the value of cross entropy, N is the number of training samples, y _i For sample x _i Is set to be a desired output of f (x _i ) For sample x _i Is provided.

Example two

Fig. 5 schematically shows a flowchart of a text recognition method according to a second embodiment of the present invention.

As shown in fig. 5, the training method of the sensitive content identification model may include steps S300 to S305, in which:

step S300, determining a target user text to be processed; and extracting text feature vectors of the target user text.

The method for extracting the text feature vector of the target user text is the same as the method corresponding to the first embodiment, and will not be described herein.

Step 301, searching a target user account associated with the target user text according to the target user text;

illustratively, the computer device 6 in this embodiment searches for a target user account that publishes the target user text through the cloud server 2.

Step S302, according to the target user account, acquiring an account characteristic vector corresponding to the target user account;

preferably, the computer device 6 may query, in the knowledge graph constructed by the first embodiment, a target node corresponding to the target user account, other nodes associated with the target node, and edges between the target node and the other nodes; the objective node, other nodes and edges between the objective node and other nodes are embedded into the objective function provided in the first embodiment, and the account feature vector corresponding to the objective node is calculated through the objective function.

Step S303, splicing the account feature vector and the text feature vector to obtain a fusion feature vector;

and step S304, inputting the fusion feature vector into a trained sensitive content recognition model to output a target text type of the target user text through the sensitive content recognition model, wherein the target text type is sensitive content text or non-sensitive content text.

The step of inputting the fusion feature vector into a trained sensitive content recognition model to output a target text type of the target user text through the sensitive content recognition model includes:

step S304-1, inputting the fusion feature vector into each classifier in the plurality of classifiers to obtain a plurality of text types; wherein the text types are in one-to-one correspondence with the output results of the classifiers; and determining the text types with the quantity ratio larger than a preset threshold value as the target text types according to the plurality of text types. Wherein the sensitive content identification model comprises a plurality of classifiers, and the preset threshold value is preferably 1/2.

For example, because the recall ratio and the precision ratio obtained after training of multiple sets of training samples by different classifiers are different, a single classifier is adopted to make a decision on the target type of the target user text, so that a larger prediction error exists, and in order to further reduce the prediction error of the target type of the target user text, three types of classifiers are adopted in the embodiment, namely, SVM, XGBoost and Logistic, respectively; voting the target types of the target text simultaneously through the three classifiers, wherein the voting result is 0 or 1,0 represents that the target text type is a non-sensitive content text, and 1 represents that the target text type is a sensitive content text; for example, three classifiers, two of which are 0 in the target type of target user text recognition, and the other of which is 1 in the target type of target user text recognition, i.e., the result of target user text recognition is non-sensitive content text.

Example III

Fig. 6 schematically shows a block diagram of a training system according to a third embodiment of the invention, which may be divided into program modules, one or more of which are stored in a storage medium and executed by a processor to carry out an embodiment of the invention. Program modules in accordance with the embodiments of the present invention are directed to a series of computer program instruction segments capable of performing the specified functions, and the following description describes each program module in detail.

As shown in fig. 6, the training system 130 may include an acquisition module 131, a map construction module 132, a conversion module 133, an extraction module 134, a vector stitching module 135, and a training module 136. Wherein:

an obtaining module 131, configured to obtain a plurality of user texts and a plurality of user accounts; each user text is a sensitive content text or a non-sensitive content text, and each user text is respectively associated with one user account;

the map construction module 132 is configured to construct a knowledge map based on the plurality of user accounts and association relationships among the user accounts; the knowledge graph comprises a plurality of nodes, and each node corresponds to one user account in the plurality of user accounts;

A conversion module 133, configured to obtain a plurality of account feature vectors according to the plurality of nodes, where each account feature vector corresponds to one node of the plurality of nodes;

an extracting module 134, configured to extract a plurality of text feature vectors of the plurality of user texts, each of the text feature vectors corresponding to one user text;

the vector stitching module 135 is configured to stitch each user account feature vector and one or more corresponding text feature vectors to obtain a plurality of fused feature vectors;

the training module 136 is configured to take the plurality of fusion feature vectors as a plurality of sets of training samples, and input each set of training samples into a classification model to be trained, so as to train the classification model to be trained, so as to obtain the sensitive content recognition model.

The map construction module 132 is further configured to obtain account information of each user account in the plurality of user accounts, so as to obtain a plurality of user account information; acquiring a plurality of groups of associated accounts based on the plurality of user account information; the user account information comprises registration login information of corresponding user accounts, and each group of associated accounts comprises two user accounts with the same at least one user account information; constructing a knowledge graph according to a plurality of groups of associated account numbers; each user account corresponds to one node in the knowledge graph, and the same user account information between the two user accounts of each group of associated accounts is used for constructing an edge between the corresponding two nodes.

The conversion module 133 is further configured to embed a plurality of nodes and a plurality of edges corresponding to the plurality of groups of associated accounts on the knowledge graph into an objective function, and calculate a plurality of account feature vectors corresponding to the plurality of nodes through the objective function, where the objective function is:

The extracting module 134 is further configured to pre-process the plurality of user texts to obtain a plurality of vector matrices; the plurality of vector matrices are input to a convolutional neural network to obtain a plurality of text feature vectors corresponding to the plurality of vector matrices, each vector matrix corresponding to one of the plurality of text feature vectors.

The extracting module 134 is further configured to perform word segmentation processing on each sentence in the plurality of user texts, so as to obtain a word segmentation set of each user text; encoding each word in each word segmentation set to convert each word in each word segmentation set into a corresponding word vector; taking sentences of each user text as units, and acquiring a vector matrix of each sentence; the vector matrix is constructed according to a plurality of word vectors corresponding to a plurality of words of the corresponding sentence, and each row of the vector matrix corresponds to one word vector.

Example IV

Fig. 7 schematically shows a hardware architecture diagram of a computer device 6 adapted to implement training of a sensitive content recognition model and text recognition according to a fourth embodiment of the invention. In the present embodiment, the computer device 6 is a device capable of automatically performing numerical calculation and/or information processing in accordance with instructions set or stored in advance. For example, it may be a personal computer, rack server, blade server, tower server or cabinet server (including a stand-alone server or a server cluster composed of a plurality of servers), gateway, etc. As shown in fig. 7, the computer device 6 includes at least, but is not limited to: memory 141, processor 142, network interface 143 may be communicatively linked to each other via a system bus; wherein:

the memory 141 includes at least one type of computer-readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 141 may be an internal storage module of the computer device 6, such as a hard disk or memory of the computer device 6. In other embodiments, the memory 141 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 6. Of course, memory 141 may also include both internal memory modules of computer device 6 and external memory devices. In this embodiment, the memory 141 is typically used to store an operating system and various types of application software installed on the computer device 6, such as program code for a virus scanning and presentation method. In addition, the memory 141 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 142 may be a central processing unit (Central Processing Unit, simply CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 142 is typically used to control the overall operation of the computer device 6, such as performing control and processing related to data interaction or communication with the computer device 6, and the like. In this embodiment, processor 142 is used to execute program code or process data stored in memory 141.

The network interface 143 may comprise a wireless network interface or a wired network interface, the network interface 143 typically being used to establish a communication link between the computer device 6 and other computer devices. For example, the network interface 143 is used to connect the computer device 6 to an external terminal through a network, establish a data transmission path and a communication link between the computer device 6 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, abbreviated as GSM), wideband code division multiple access (Wideband Code Division Multiple Access, abbreviated as WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, etc.

It should be noted that fig. 7 only shows a computer device having components 141-143, but it should be understood that not all of the illustrated components are required to be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the training and text recognition method of the sensitive content recognition model stored in the memory 141 may also be divided into one or more program modules and executed by a processor (the processor 142 in this embodiment) to perform the embodiment of the present invention.

Example five

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the training of the sensitive content recognition model and the text recognition method in the embodiments.

In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may also be an external storage device of a computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), etc. that are provided on the computer device. Of course, the computer-readable storage medium may also include both internal storage units of a computer device and external storage devices. In this embodiment, the computer readable storage medium is typically used to store an operating system and various application software installed on a computer device, for example, program code for implementing the training of the sensitive content recognition model and the text recognition method in the embodiment. Furthermore, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method for training a sensitive content recognition model, the method comprising:

Acquiring a plurality of user texts and a plurality of user accounts; each user text is sensitive content text or non-sensitive content text, and each user text is respectively associated with one user account;

constructing a knowledge graph based on the plurality of user accounts and the association relation among the user accounts; the knowledge graph comprises a plurality of nodes, and each node corresponds to one user account in the plurality of user accounts;

acquiring a plurality of account feature vectors according to the plurality of nodes, wherein each account feature vector corresponds to one node in the plurality of nodes;

extracting a plurality of text feature vectors of the plurality of user texts, wherein each text feature vector corresponds to one user text;

splicing each user account feature vector with one or more corresponding text feature vectors to obtain a plurality of fusion feature vectors; a kind of electronic device with high-pressure air-conditioning system

Taking the fusion feature vectors as a plurality of groups of training samples, respectively inputting each group of training samples into a classification model to be trained, and training the classification model to be trained to obtain the sensitive content identification model;

the step of obtaining a plurality of account feature vectors according to the plurality of nodes comprises the following steps:

Embedding a plurality of nodes and a plurality of edges corresponding to a plurality of groups of associated accounts on the knowledge graph into an objective function, and calculating a plurality of account feature vectors corresponding to the plurality of nodes through the objective function, wherein the objective function is as follows:

wherein E is _ij Representing the weight of an edge; u (u) _i 、u _j Respectively represent the ith node v _i The j-th node v _j ；Φ'(u _j ) Representing node v _j Account feature vector representation of neighboring nodes of phi (u) _i )、Φ(u _j ) Respectively represent node v _i Node v _j Account feature vector representation of (a); p is p ₁ (Φ(u _i )，Φ(u _j ) Represents account feature vector Φ (u) _i )、Φ(u _j ) Is a joint probability distribution of (1); p is p ₂ (Φ(u _i ) Phi' (u) _j ) Represents a given phi' (u) _j ) Account feature vector phi (u) _i ) Conditional probability distribution of (2);

wherein the step of extracting a plurality of text feature vectors of the plurality of user texts comprises,

preprocessing the text of the plurality of users to obtain a plurality of vector matrixes;

inputting the plurality of vector matrices into a convolutional neural network to obtain a plurality of text feature vectors corresponding to the plurality of vector matrices, each vector matrix corresponding to one of the plurality of text feature vectors;

the step of preprocessing the text of the plurality of users to obtain a plurality of vector matrixes comprises the following steps:

Performing word segmentation processing on each sentence in the plurality of user texts to obtain a word segmentation set of each user text;

encoding each word in each word segmentation set to convert each word in each word segmentation set into a corresponding word vector;

taking sentences of each user text as units, and acquiring a vector matrix of each sentence; the vector matrix is constructed according to a plurality of word vectors corresponding to a plurality of words of the corresponding sentence, and each row of the vector matrix corresponds to one word vector.

2. The method for training a sensitive content identification model according to claim 1, wherein the step of constructing a knowledge graph based on the plurality of user accounts and association relations between the user accounts comprises:

acquiring account information of each user account in the plurality of user accounts to obtain a plurality of user account information;

acquiring a plurality of groups of associated accounts based on the plurality of user account information; the user account information comprises registration login information of corresponding user accounts, and each group of associated accounts comprises two user accounts with the same at least one user account information; a kind of electronic device with high-pressure air-conditioning system

Constructing a knowledge graph according to a plurality of groups of associated account numbers; each user account corresponds to one node in the knowledge graph, and the same user account information between the two user accounts of each group of associated accounts is used for constructing an edge between the corresponding two nodes.

3. A method of text recognition, comprising:

determining a target user text to be processed, and extracting a text feature vector of the target user text;

searching a target user account associated with the target user text according to the target user text;

according to the target user account, acquiring an account characteristic vector corresponding to the target user account;

splicing the account feature vector and the text feature vector to obtain a fusion feature vector; a kind of electronic device with high-pressure air-conditioning system

Inputting the fusion feature vector into a trained sensitive content recognition model to output a target text type of the target user text through the sensitive content recognition model, wherein the target text type is sensitive content text or non-sensitive content text, and the sensitive content recognition model is a model obtained by training the training method of the sensitive content recognition model according to any one of claims 1-2.

4. A text recognition method according to claim 3, wherein the sensitive content recognition model includes a plurality of classifiers, and the step of inputting the fused feature vector into a trained sensitive content recognition model to output a target text type of the target user text through the sensitive content recognition model comprises:

Inputting the fusion feature vector into each classifier in the plurality of classifiers to obtain a plurality of text types; wherein the text types are in one-to-one correspondence with the output results of the classifiers; a kind of electronic device with high-pressure air-conditioning system

And determining the text types with the quantity ratio larger than a preset threshold value as the target text types according to the plurality of text types.

5. A training system for a sensitive content identification model, comprising:

the acquisition module is used for acquiring a plurality of user texts and a plurality of user accounts; each user text is sensitive content text or non-sensitive content text, and each user text is respectively associated with one user account;

the training module is used for taking the fusion feature vectors as a plurality of groups of training samples, respectively inputting each group of training samples into a classification model to be trained, and training the classification model to be trained to obtain the sensitive content recognition model;

wherein, the conversion module is further used for:

wherein, the extraction module is further used for:

the preprocessing the text of the plurality of users to obtain a plurality of vector matrixes comprises the following steps:

6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor is adapted to implement the steps of the training method of the sensitive content identification model of any of claims 1-2 or the text identification method of any of claims 3-4 when the computer program is executed.

7. A computer-readable storage medium, having stored therein a computer program executable by at least one processor to cause the at least one processor to perform the steps of the training method of the sensitive content identification model of any one of claims 1-2, or the text identification method of any one of claims 3-4.