CN111522958A

CN111522958A - Text classification method and device

Info

Publication number: CN111522958A
Application number: CN202010468798.8A
Authority: CN
Inventors: 陈利琴; 闫永泽; 刘设伟
Original assignee: Taikang Insurance Group Co Ltd; Taikang Online Property Insurance Co Ltd
Current assignee: Taikang Insurance Group Co Ltd; Taikang Online Property Insurance Co Ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-08-11

Abstract

The invention discloses a text classification method and device, and relates to the technical field of computers. The method comprises the following steps: constructing a first training sample set based on the text data carrying the class labels, and constructing a second training sample set based on the text data not carrying the class labels; carrying out supervised learning on the first training sample set, and determining the value of a loss function under the supervised learning; carrying out unsupervised learning on the second training sample set, and determining the value of a loss function under the unsupervised learning; determining the value of a mixed loss function according to the value of the loss function under supervised learning and the value of the loss function under unsupervised learning; updating parameters of the text classification model according to the value of the mixed loss function to obtain a trained text classification model; and determining the category of the text to be detected based on the trained text classification model. Through the steps, the training efficiency of the text classification model can be improved, and the classification accuracy of the text classification can be improved.

Description

Text classification method and device

Technical Field

The invention relates to the technical field of computers, in particular to a text classification method and device.

Background

Text classification is an important task in the field of natural language processing, and its goal is to assign documents to a predefined category. The existing text classification method generally adopts a supervised learning method, and only one loss function is generally adopted.

In the process of implementing the invention, the inventor of the invention finds that: when the text classification model training is carried out based on supervised learning, a large amount of labeled data is often needed, but a large amount of unlabeled data generally exists in real life. In consideration of the large amount of unlabeled data, manual labeling or automatic labeling is generally adopted in the prior art. However, manual labeling is time-consuming and labor-consuming, and the training efficiency of the text classification model is slowed to a certain extent; the automatic labeling is likely to be inaccurate, and the classification effect of the text classification model is influenced to a certain extent. In addition, considering that the loss function has an important influence on the model parameters of the final text classification model, the effect of the text classification model obtained based on the existing single loss function training is often not very good.

Disclosure of Invention

In view of this, the present invention provides a text classification method and device, which can not only improve the training efficiency of a text classification model, but also improve the classification accuracy of text classification.

To achieve the above object, according to one aspect of the present invention, there is provided a text classification method.

The text classification method comprises the following steps: constructing a first training sample set based on the text data carrying the class labels, and constructing a second training sample set based on the text data not carrying the class labels; performing supervised learning on the first training sample set, and determining the value of a loss function under the supervised learning; carrying out unsupervised learning on the second training sample set, and determining the value of a loss function under the unsupervised learning; determining a value of a mixed loss function according to the value of the loss function under supervised learning and the value of the loss function under unsupervised learning; updating parameters of the text classification model according to the values of the mixing loss function to obtain an updated text classification model; under the condition that the updated text classification model meets the iterative training stopping condition, taking the updated text classification model as a trained text classification model; and processing the text to be detected based on the trained text classification model so as to determine the category of the text to be detected.

Optionally, the performing supervised learning on the first training sample set, and determining the value of the loss function under the supervised learning includes: inputting the first training sample set into a text classification model to obtain a first output value of the text classification model, and determining a first value of a first loss function according to the first output value of the text classification model; performing countermeasure disturbance on the first training sample set to obtain a countermeasure sample set; inputting the confrontation sample set into the text classification model to obtain a second output value of the text classification model, and determining a second value of the first loss function according to the second output value of the text classification model; wherein the text classification model is a pre-training model.

Optionally, the first loss function is a cross entropy loss function or a conditional entropy loss function.

Optionally, when the first loss function is a cross-entropy loss function, performing countermeasure perturbation on the first training sample set to obtain a countermeasure sample set includes: determining a disturbance value for resisting disturbance to the first training sample set according to the following formula, and injecting the disturbance value into the first training sample set to obtain the resisting sample set:

r_adv＝g/||g||₂；

v’＝v+r_adv；

wherein r is_advRepresenting a disturbance value; a coefficient representing a magnitude of the control disturbance; g represents a gradient value obtained by deriving the parameters of the embedded layer after the first output value of the text classification model is calculated; | g | calculation of luminance₂L representing a gradient value₂A paradigm;

an output value representing a text classification model; v' represents the generated challenge sample; v denotes training samples in the first set of training samples.

Optionally, the performing unsupervised learning on the second training sample set and determining the value of the loss function under unsupervised learning comprises: inputting the second training sample set into a text classification model to obtain a third output value of the text classification model, and determining a first value of a second loss function according to the third output value of the text classification model; performing virtual countermeasure disturbance on the second training sample set to obtain a virtual countermeasure sample set; and inputting the virtual countermeasure sample set into the text classification model to obtain a fourth output value of the text classification model, and determining a second value of a second loss function according to the fourth output value of the text classification model.

Optionally, the second loss function is a conditional entropy loss function; the performing virtual confrontation perturbation on the second training sample set to obtain a virtual confrontation sample set includes: performing countermeasure disturbance on the second training set according to the random initialization vector to obtain a countermeasure sample set; inputting the confrontation sample set into the text classification model to obtain a fifth output value of the text classification model; calculating a KL divergence value according to the third output value and the fifth output value of the text classification model; and determining a disturbance value for performing virtual countermeasure disturbance on the second training sample set according to the KL scattering value, and injecting the disturbance value into the second training sample set to obtain the virtual countermeasure sample set.

Optionally, the processing the text to be detected based on the trained text classification model to determine the category of the text to be detected includes: preprocessing a text to be detected, and constructing a feature vector to be detected according to the preprocessed text; and inputting the characteristic vector to be detected into a trained text classification model to determine the category of the text to be detected.

To achieve the above object, according to another aspect of the present invention, there is provided a text classification apparatus.

The text classification device of the present invention includes: the building module is used for building a first training sample set based on the text data carrying the category labels and building a second training sample set based on the text data not carrying the category labels; the determining module is used for carrying out supervised learning on the first training sample set and determining the value of a loss function under the supervised learning; carrying out unsupervised learning on the second training sample set, and determining the value of a loss function under the unsupervised learning; determining a value of a mixed loss function according to the value of the loss function under supervised learning and the value of the loss function under unsupervised learning; the updating module is used for updating the parameters of the text classification model according to the values of the mixing loss function; the text classification module is also used for taking the updated text classification model as a trained text classification model under the condition that the updated text classification model meets the iterative training stopping condition; and the classification module is used for processing the text to be detected based on the trained text classification model so as to determine the category of the text to be detected.

To achieve the above object, according to still another aspect of the present invention, there is provided an electronic apparatus.

The electronic device of the present invention includes: one or more processors; and storage means for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement the text classification method of the present invention.

To achieve the above object, according to still another aspect of the present invention, there is provided a computer-readable medium.

The computer-readable medium of the invention, on which a computer program is stored which, when being executed by a processor, carries out the text classification method of the invention.

One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of constructing a first training sample set based on text data carrying class labels, constructing a second training sample set based on text data not carrying class labels, carrying out supervised learning on the first training sample set, determining the value of a loss function under the supervised learning, carrying out unsupervised learning on the second training sample set, determining the value of a loss function under the unsupervised learning, determining the value of a mixed loss function according to the value of the loss function under the supervised learning and the value of the loss function under the unsupervised learning, updating parameters of a text classification model according to the value of the mixed loss function, taking the updated text classification model as a trained text classification model and determining the class of the text to be detected based on the trained text classification model under the condition that the updated text classification model meets an iterative training stop condition, the training efficiency of the text classification model can be improved, and the classification accuracy of the text classification can be improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

fig. 1 is a main flow diagram of a text classification method according to a first embodiment of the invention;

FIG. 2 is a schematic diagram of a main flow of a model training phase in a text classification method according to a second embodiment of the present invention;

FIG. 3 is a schematic flow chart of a text compliance detection method according to a third embodiment of the present invention;

fig. 4 is a schematic diagram of the main blocks of a text classification apparatus according to a fourth embodiment of the present invention;

FIG. 5 is a schematic diagram of the main blocks of a text compliance detection apparatus according to a fifth embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

FIG. 7 is a block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

Fig. 1 is a main flow diagram of a text classification method according to a first embodiment of the present invention. As shown in fig. 1, the text classification method in the embodiment of the present invention includes:

step S101, a first training sample set is constructed based on the text data carrying the category labels, and a second training sample set is constructed based on the text data not carrying the category labels.

For example, in a text compliance detection scenario, the category labels may include two category labels, "compliance" and "non-compliance". Further, the "compliance" category label may be represented by "0" and the "non-compliance" category label may be represented by "1". In this example, text data carrying a category label may be input into an embedding layer to obtain a first training sample set, and text data not carrying a category label may be input into an embedding layer to obtain a second training sample set. The text data carrying the category labels comprise a plurality of labeled sentences, and the first training sample set comprises feature vectors corresponding to the labeled sentences; the text data which does not carry the category labels comprises a plurality of unlabeled sentences, and the second training sample set comprises feature vectors corresponding to the unlabeled sentences.

Further, in an alternative implementation of this example, the embedding layer is divided into three layers in total, namely, token embedding layer, segment embedding layer, and position embedding layer. After the marked sentences are input into the three embedding layers, three different vector representations can be obtained, and then the sum of the three vector representations is used as the characteristic vector of the marked sentences; after the unlabeled sentence is input into the three embedding layers, three different vector representations can be obtained, and then the sum of the three vector representations is used as the feature vector of the unlabeled sentence. Further, after inputting the text data carrying the category labels into the three embedding layers, a first training sample set can be obtained; after the text data which does not carry the category label is input into the three embedding layers, a second training sample set can be obtained.

It should be noted that the present invention is not limited to the application scenario of text compliance detection. Under the condition of not influencing the implementation of the invention, the invention can be suitable for various text classification scenes.

Step S102, carrying out supervised learning on the first training sample set, and determining the value of a loss function under the supervised learning; carrying out unsupervised learning on the second training sample set, and determining the value of a loss function under the unsupervised learning; and determining the value of the mixed loss function according to the value of the loss function under the supervised learning and the value of the loss function under the unsupervised learning.

Illustratively, in this step, a first training sample set may be input into the text classification model for supervised learning, and a value of a loss function under the supervised learning may be determined; inputting a second training sample set into the text classification model for unsupervised learning, and determining the value of a loss function under unsupervised learning; then, the value of the loss function under supervised learning and the value of the loss function under unsupervised learning are weighted and summed, and the weighted and summed result is taken as the value of the mixed loss function. In specific implementation, the supervised learning can be performed on the first training sample set to determine the value of the loss function under the supervised learning, and then the unsupervised learning is performed on the second training sample set to determine the value of the loss function under the unsupervised learning; or, the second training sample set may be unsupervised to determine the value of the loss function under unsupervised learning, and then the first training sample set may be supervised to determine the value of the loss function under supervised learning.

Further, in an optional embodiment, the loss function under supervised learning is a cross entropy loss function, and the loss function under unsupervised learning is a conditional entropy loss function. In another alternative embodiment, the loss function under supervised learning is a conditional entropy loss function, and the loss function under unsupervised learning is a conditional entropy loss function.

Step S103, updating parameters of the text classification model according to the values of the mixing loss function to obtain an updated text classification model; and under the condition that the updated text classification model meets the iterative training stopping condition, taking the updated text classification model as a trained text classification model.

In step S103, "updating the parameters of the text classification model according to the values of the mixture loss function to obtain an updated text classification model" may be understood as a back propagation process. In this step, gradient values may be calculated based on the mixture-loss function, thereby updating parameters of the text classification model.

Under the condition that the updated text classification model meets the iterative training stopping condition, the trained text classification model can be finally obtained and further used in the application stage. The iterative training stop condition may be flexibly set according to a requirement, for example, the iterative training stop condition may be set as: the iterative training times reach the preset times; alternatively, the iterative training stop condition may be set to: the model classification effect evaluation parameter satisfies a preset range, and the like.

And S104, processing the text to be detected based on the trained text classification model to determine the category of the text to be detected.

In an optional example, step S104 specifically includes: preprocessing a text to be detected, and constructing a feature vector to be detected according to the preprocessed text; and inputting the characteristic vector to be detected into a trained text classification model to determine the category of the text to be detected. The preprocessing can include sentence segmentation of the text to be detected, special character removal, frequent writing to short writing, and/or English capitalization to lowercase conversion.

For example, in the scenario of detecting the compliance of the insurance contract text, the text to be detected is the insurance contract text. The preprocessing process for the insurance contract text may include the steps of: extracting all terms in the insurance contract, and carrying out sentence segmentation on all terms to obtain an initial corpus of the training model; and preprocessing the initial corpus, including removing special characters, converting from complex writing to shorthand, converting from capital English to lowercase and the like. And then, constructing a feature vector to be detected according to the preprocessed text, and inputting the feature vector to be detected into the trained text classification model to determine whether the insurance contract text is in compliance.

In the embodiment of the invention, a first training sample set is constructed based on text data carrying class labels, a second training sample set is constructed based on text data not carrying class labels, supervised learning is carried out on the first training sample set, the value of a loss function under the supervised learning is determined, unsupervised learning is carried out on the second training sample set, the value of a loss function under the unsupervised learning is determined, the value of a mixed loss function is determined according to the value of the loss function under the supervised learning and the value of the loss function under the unsupervised learning, parameters of a text classification model are updated according to the value of the mixed loss function, and the updated text classification model is taken as the trained text classification model under the condition that the updated text classification model meets the iterative training stop condition, and determining the category of the text to be detected based on the trained text classification model, so that the training efficiency of the text classification model can be improved, and the classification accuracy of the final text classification can be improved.

Fig. 2 is a schematic main flow chart of a model training phase in a text classification method according to a second embodiment of the present invention. As shown in fig. 2, the training phase of the text classification model according to the embodiment of the present invention includes:

step S201, a first training sample set is constructed based on the text data carrying the category labels, and a second training sample set is constructed based on the text data not carrying the category labels.

Further, before step S201, the method of the embodiment of the present invention may further include the following steps: data preprocessing and data annotation. For example, in a scenario of detecting compliance of insurance contract texts, the model can be trained only after data preprocessing and data labeling are performed on the contract texts.

Specifically, the data preprocessing process for the insurance contract text may include the steps of: step 1.1, extracting all terms in the insurance contract, and carrying out sentence segmentation on all terms to obtain an initial corpus of a training model; step 1.2, dividing the initial corpus generated in the step 1.1 into two types according to the required label and the non-required label, and respectively marking the two types as Marked (required label) and Unmarked (non-required label), so that the subsequent supervised learning and the unsupervised learning are respectively carried out; and step 1.3, loading the initial corpus classified in the step 1.2 for data preprocessing, including removing special characters, changing from complex writing to shorthand writing, changing from upper English writing to lower English writing and the like.

Further, the data annotation process for the insurance contract text may include the steps of: step 1.4, labeling the corpus labeled as market in step 1.3 as two categories, respectively labeled as 0 and 1 according to sentence compliance and noncompliance (wherein 0 represents compliance and 1 represents noncompliance), for example, the sentence "the starting and ending time of the insurance period of the self insurance contract is based on the date specified in the insurance contract, the insurance period is one year at the longest", the sentence "the starting and ending time of the insurance period of the self insurance contract is based on the date specified in the insurance contract at the last draft of the contract, and the insurance period is not more than one year at the longest", then the sentence "the starting and ending time of the insurance period of the self insurance contract is based on the date specified in the insurance contract, the insurance period is one year at the longest" is labeled as 1, and the sentence "the starting and ending time of the insurance period of the self insurance contract is based on the date specified in the insurance contract, the insurance period must not exceed one year at the longest "is marked as 0; and step 1.5, taking the corpus marked as Unmarked in the step 1.3 as an unsupervised learning sample, and marking all the corpora as 'UNLABEL'.

Step S202, inputting the first training sample set into a text classification model, and determining a first value of a first loss function according to a first output value of the text classification model.

In the embodiment of the present invention, the text classification model adopts a pre-training model, such as BERT, XLNET, ERNIE, and the like. The BERT model is a pre-training model developed by Google and is called Bidirective EncoderRepressions from Transformer. In specific implementation, different BERT models can be selected according to different NLP (natural language processing) tasks. For example, for a Chinese based text classification task, a corresponding Chinese pre-training model, such as the BERT BASE model, may be selected. In specific implementation, the pre-training models can be fine-tuned when specific NLP downstream tasks are performed. Specifically, in the embodiment of the present invention, the training process of the pre-training model based on the first training sample set is equivalent to a fine tuning process of the pre-training model based on the first training sample set.

Illustratively, when the text classification model adopts a BERT model, it may specifically include: a bidirectional transformer (coding) layer, a full-concatenation layer, and a softmax layer. In this example, after the first training sample set is obtained in step S201, it is input to a multi-layer bidirectional fransformer layer, so that the context information of the text can be sufficiently learned by using a multi-attention mechanism inside the bidirectional fransformer layer, thereby obtaining a bidirectional encoded representation of the text; and then, the feature vectors subjected to bidirectional encoding by the bidirectional transformer layer sequentially pass through the linear layer and the softmax layer, so that a first output value of the text classification model, namely the text category prediction probability of each sample in the first training sample set, can be calculated by utilizing a linear transformation function of the linear layer and a softmax function of the softmax layer. Thereafter, a first value of the first loss function may be determined from a first output value of the text classification model.

In the embodiment of the invention, the pre-training model is selected as the text classification model, and the labeled data and the unlabeled data are combined for semi-supervised learning, so that the context information of the text can be better extracted, the training efficiency is improved, and a better text classification model can be trained under the condition of less data.

In an alternative embodiment, the first loss function is a cross-entropy loss function selected to represent the difference between the text class prediction probability of the sample and its true class label. Specifically, the first value of the first loss function may be expressed as:

wherein L is_ml(theta) represents a first value of the first loss function, f (y)⁽ⁱ⁾K) represents the indicator function if the sample is subjected to the class label k and its true label y predicted by the text classification modelⁱEqual, then f (y)⁽ⁱ⁾K) 1, otherwise f (y)⁽ⁱ⁾＝k)＝0；m_lRepresenting a number of samples of a first training sample set;

representing samples in a first set of training samples; y is⁽ⁱ⁾To represent

A category label of (1); k represents a category label set; θ represents a model parameter;

representing output values of a text classification model, in particular representing samples

The probability of being predicted as class k.

In another alternative embodiment, the first loss function is a conditional entropy loss function, which is used to represent the difference between the text class prediction probability of the sample and its true class label.

Step S203, performing countermeasure disturbance on the first training sample set to obtain a countermeasure sample set; and inputting the confrontation sample set into the text classification model, and determining a second value of the first loss function according to a second output value of the text classification model.

For example, in this step, a perturbation value for countering perturbation to the first training sample set may be determined according to the following formula, and the perturbation value is injected into the first training sample set to obtain the countering sample set:

r_adv＝g/||g||₂；

v’＝v+r_adv；

After the challenge sample set is obtained according to the above formula, the challenge sample set may be input into the text classification model, and a second value of the first loss function may be determined according to the model output value (i.e., the second output value of the text classification model described above). Specifically, the second value of the first loss function may be expressed as:

wherein L is_at(θ) represents a second value of the first loss function; f (y)⁽ⁱ⁾K) represents the indication function if the confrontation sample is predicted by the text classification model as class label k and its true label yⁱEqual, then f (y)⁽ⁱ⁾K) 1, otherwise f (y)⁽ⁱ⁾＝k)＝0；m_lRepresenting the number of challenge samples in the set of challenge samples, (v + r)_adv)⁽ⁱ⁾Presentation pairChallenge samples in the sample set; y is⁽ⁱ⁾A category label representing the confrontation sample; k represents a category label set; θ represents a model parameter;

p(y⁽ⁱ⁾＝k(v+r_adv)⁽ⁱ⁾(ii) a θ) represents the output value of the text classification model, in particular the probability that the challenge sample is predicted to be of class k.

In the embodiment of the invention, the text classification model can have robustness against noise by injecting the countercheck sample in the training process of the model, and the generalization capability of the text classification model is improved.

And S204, inputting the second training sample set into a text classification model, and determining a first value of a second loss function according to a third output value of the text classification model.

Illustratively, in this step, the second loss function is a conditional entropy loss function, which is used to represent the difference between the text class prediction probability of the sample and its true class label. Specifically, the second loss function may be expressed as:

wherein L is_em(θ) represents a first value of the second function; m is_*Representing the number of samples in the second set of training samples;

representing samples in a second set of training samples; y is⁽ⁱ⁾A category label representing the sample; k represents a category label set; θ represents a model parameter;

The probability of being predicted as class k.

Step S205, performing virtual countermeasure disturbance on the second training sample set to obtain a virtual countermeasure sample set; and inputting the virtual confrontation sample set into the text classification model, and determining a second value of a second loss function according to a fourth output value of the text classification model.

For example, the second training sample set may be subjected to virtual confrontation perturbation according to the processing manner shown in steps a to d to obtain a virtual confrontation sample set:

and a, performing countermeasure disturbance on the second training set according to the random initialization vector to obtain a countermeasure sample set.

Specifically, in step a, the second training set may be subjected to the counterperturbation according to the following formula:

v'＝v+ζd

wherein v' represents the generated challenge sample; v represents a training sample in the second set of training samples; ζ represents a hyper-parameter; d denotes a random initialization vector. In specific implementation, a vector d may be randomly initialized for each training sample in the second training sample set, and an element value of the vector d conforms to a normal distribution.

And b, inputting the confrontation sample set into the text classification model to obtain a fifth output value of the text classification model.

Specifically, in this step, the confrontation sample set obtained in step a is input into the text classification model, and a fifth output value of the text classification model, that is, a prediction probability value p (v') of the text category label can be obtained.

And c, calculating a KL divergence value according to the third output value and the fifth output value of the text classification model.

Specifically, in this step, a KL variance value may be calculated from the third output value p (v) and the fifth output value p (v') of the text classification model. The KL divergence value, which may also be referred to as relative entropy, is a measure of the asymmetry of the difference between two probability distributions (probabilitydistribution).

And d, determining a disturbance value for performing virtual countermeasure disturbance on the second training sample set according to the KL scattering value, and injecting the disturbance value into the second training sample set to obtain the virtual countermeasure sample set.

Specifically, in step d, the disturbance value of the virtual countermeasure disturbance can be calculated according to the following formula:

r_vat＝g/||g||₂

wherein r is_vatA disturbance value representing a virtual countermeasure disturbance; g represents the gradient value of the KL divergence value after the derivative of the v'; is a hyper-parameter; | g | calculation of luminance₂Is an L2 paradigm of gradient values g.

After obtaining the perturbation value of the virtual countermeasure perturbation according to steps a to d, the perturbation value may be injected into a second training sample set to obtain a virtual countermeasure sample set. The dummy challenge sample set may then be input into the text classification model, and a second value of the second loss function may be determined according to the model output value (i.e., the fourth output value of the text classification model described above). Specifically, the second value of the second loss function may be expressed as:

wherein L is_vat(θ) represents a second value of the second loss function; representing the number of samples in the virtual confrontation sample set; d_KLIndicating the KL dispersion value.

Step S206, carrying out weighted summation on the first value and the second value of the first loss function and the first value and the second value of the second loss function to obtain the value of the mixed loss function.

In the training of the text classification model, because different loss functions have important influence on the final model parameters, the values of the loss functions can be fused in a weighted summation mode to obtain the value of a mixed loss function. Specifically, the mixing loss function may be expressed as:

L_loss＝λ_mlL_ml(θ)+λ_atL_at(θ)+λ_emL_em(θ)+λ_vatL_vat(θ)

wherein L is_lossRepresenting a mixing loss function; parameter lambda_ml、λ_at、λ_em、λ_vatRepresents a weight value; l is_ml(θ) represents a first value of the first loss function; l is_at(θ) represents a second value of the first loss function; l is_em(θ) represents a first value of the second loss function; l is_vat(θ) represents a second value of the second loss function.

And step S207, updating parameters of the text classification model according to the values of the mixing loss function to obtain an updated text classification model.

Step S207 may be understood as a back propagation process. In this step, gradient values may be calculated based on the mixture-loss function, thereby updating parameters of the text classification model.

And S208, taking the updated text classification model as a trained text classification model under the condition that the updated text classification model meets the iterative training stop condition.

In the embodiment of the present invention, step S201 to step S207 are iteratively executed until an iterative training stop condition is satisfied, so that a trained text classification model can be finally obtained for use in an application stage.

In the embodiment of the invention, the text classification model is trained in a mode of combining supervised learning and unsupervised learning, so that the manual marking workload can be reduced, the problem of poor model training effect caused by inaccurate automatic marking can be solved, the training efficiency of the text classification model can be improved, and the classification accuracy of the final text classification model can be improved; furthermore, the loss function under supervised learning and the loss function under unsupervised learning are subjected to weighted summation, and the model parameters are updated based on the mixed loss function obtained by weighted summation, so that the training effect of the text classification model can be improved, and the accuracy of text compliance detection is further improved.

Fig. 3 is a main flow chart of a text compliance detection method according to a third embodiment of the present invention. As shown in fig. 3, the text compliance detection method according to the embodiment of the present invention includes:

step S301, preprocessing the text to be detected, and constructing the feature vector to be detected according to the preprocessed text.

Illustratively, in the scenario of detecting the compliance of the insurance contract text, the text to be detected is the insurance contract text. In this example, the pre-processing procedure for the insurance contract text may include the steps of: extracting all terms in the insurance contract, and carrying out sentence segmentation on all terms to obtain an initial corpus of the training model; and preprocessing the initial corpus, including removing special characters, converting from complex writing to shorthand, converting from capital English to lowercase and the like. And then, inputting the preprocessed text into an embedding layer to obtain a feature vector to be detected.

Further, in an alternative implementation of this example, the embedding layer is divided into three layers in total, namely, token embedding layer, segment embedding layer, and position embedding layer. After the preprocessed text is input into the three embedding layers, three different vector representations can be obtained, and then the sum of the three vector representations is used as a feature vector to be detected.

Step S302, inputting the feature vector to be detected into a trained text classification model to determine the category of the text to be detected.

In this step, a text classification model trained based on the embodiment shown in fig. 1 or fig. 2 may be used to determine whether the text to be detected is a compliant text. For example, if a "0" indicates a compliant text and a "1" indicates an unconventional text, a return value of the text classification model of 0 indicates that the text to be detected is compliant, and a return value of the text classification model of 1 indicates that the text to be detected is not compliant.

In the embodiment of the invention, the automatic and intelligent text compliance detection is realized through the steps S301 to S302, so that the labor cost and the time cost in the text compliance detection are greatly reduced, and the accuracy of the text compliance detection is improved.

Fig. 4 is a schematic diagram of main blocks of a text classification apparatus according to a fourth embodiment of the present invention. As shown in fig. 4, the text classification apparatus 400 according to the embodiment of the present invention includes: a building module 401, a determining module 402, an updating module 403, and a classifying module 404.

The building module 401 is configured to build a first training sample set based on the text data carrying the category label, and build a second training sample set based on the text data not carrying the category label.

A determining module 402, configured to perform supervised learning on the first training sample set, and determine a value of a loss function under the supervised learning; a determining module 402, configured to perform unsupervised learning on the second training sample set, and determine a value of a loss function under unsupervised learning; the determining module 402 is further configured to determine a value of a hybrid loss function according to the value of the loss function under supervised learning and the value of the loss function under unsupervised learning.

Illustratively, the text classification model employs a pre-trained model, such as BERT, XLNET, ERNIE, and the like. The BERT model is a pre-training model developed by Google and is called Bidirective EncoderRepressions from Transformer. In specific implementation, different BERT models can be selected according to different NLP (natural language processing) tasks. For example, for a Chinese based text classification task, a corresponding Chinese pre-training model, such as the BERT BASE model, may be selected.

Specifically, the determining module 402 may input the first training sample set into the text classification model for supervised learning, and determine a value of a loss function under the supervised learning; the determining module 402 inputs the second training sample set into the text classification model for unsupervised learning, and determines a value of a loss function under unsupervised learning; then, the determination module 402 performs weighted summation on the value of the loss function under supervised learning and the value of the loss function under unsupervised learning, and takes the weighted summation result as the value of the mixed loss function. In specific implementation, the determining module 402 may perform supervised learning on the first training sample set to determine a value of the loss function under the supervised learning, and then perform unsupervised learning on the second training sample set to determine a value of the loss function under the unsupervised learning; alternatively, the determining module 402 may perform unsupervised learning on the second training sample set to determine the value of the loss function under unsupervised learning, and then perform supervised learning on the first training sample set to determine the value of the loss function under supervised learning.

An updating module 403, configured to update parameters of the text classification model according to the value of the mixing loss function; and the method is also used for taking the updated text classification model as a trained text classification model under the condition that the updated text classification model meets the iterative training stop condition.

In particular. The update module 403 may calculate the gradient values based on the mixture-loss function, thereby updating the parameters of the text classification model. The text classification model is trained through the iterative call construction module, the determination module and the update module, and the trained text classification model can be finally obtained and further used in an application stage.

And the classification module 404 is configured to process the text to be detected based on the trained text classification model to determine the category of the text to be detected.

In an optional example, the processing, by the classification module 404, the text to be detected based on the trained text classification model to determine the category of the text to be detected specifically includes: the classification module 404 preprocesses the text to be detected, and constructs a feature vector to be detected according to the preprocessed text; the classification module 404 inputs the feature vector to be detected into a trained text classification model to determine the category of the text to be detected. The preprocessing can include sentence segmentation of the text to be detected, special character removal, frequent writing to short writing, and/or English capitalization to lowercase conversion.

For example, in the scenario of detecting the compliance of the insurance contract text, the text to be detected is the insurance contract text. The preprocessing process for the insurance contract text may include: the classification module 404 extracts all terms in the insurance contract, and performs sentence segmentation on all terms to obtain an initial corpus of the training model; the classification module 404 performs preprocessing on the initial corpus including removing special characters, converting from capitalization to shorthand, converting from capitalization to lowercase, and so on. Then, the classification module 404 may construct a feature vector to be detected according to the preprocessed text, and then input the feature vector to be detected into the trained text classification model to determine whether the insurance contract text is compliant.

In the device provided by the embodiment of the invention, a first training sample set is constructed through a construction module based on text data carrying class labels, and a second training sample set is constructed based on text data not carrying class labels; carrying out supervised learning on the first training sample set through a determining module, determining the value of a loss function under the supervised learning, carrying out unsupervised learning on the second training sample set, determining the value of the loss function under the unsupervised learning, and determining the value of a mixed loss function according to the value of the loss function under the supervised learning and the value of the loss function under the unsupervised learning; updating parameters of a text classification model through an updating module according to the value of the mixing loss function, and taking the updated text classification model as a trained text classification model under the condition that the updated text classification model meets the iterative training stopping condition; the classification module determines the classification of the text to be detected based on the trained text classification model, so that the training efficiency of the text classification model can be improved, and the classification accuracy of the final text classification can be improved.

Fig. 5 is a schematic diagram of main blocks of a text compliance detection apparatus according to a fifth embodiment of the present invention. As shown in fig. 5, the text compliance detection apparatus 500 according to the embodiment of the present invention includes: a preprocessing module 501, a construction module 502 and a detection module 503.

The preprocessing module 501 is configured to preprocess a text to be detected. Illustratively, in the scenario of detecting the compliance of the insurance contract text, the text to be detected is the insurance contract text. In this example, the preprocessing process of the preprocessing module 501 for the insurance contract text may include: the preprocessing module 501 extracts all terms in the insurance contract, and performs sentence division on all terms to obtain an initial corpus of the training model; the preprocessing module 501 performs preprocessing on the initial corpus, including removing special characters, converting from capitalization to shorthand, converting from capitalization to lowercase, and the like. Then, the preprocessing module 501 inputs the preprocessed text into an embedding layer, so as to obtain a feature vector to be detected.

And the constructing module 502 is configured to construct the feature vector to be detected according to the preprocessed text. For example, the building module 502 may input the preprocessed text into an embedding layer to obtain the feature vector to be detected. Further, in an alternative implementation of this example, the embedding layer is divided into three layers in total, namely, token embedding layer, segment embedding layer, and position embedding layer. After the preprocessed text is input into the three embedding layers, the building module 502 can obtain three different vector representations, and then the sum of the three vector representations is used as the feature vector to be detected.

The detection module 503 is configured to input the feature vector to be detected into the trained text classification model to determine the category of the text to be detected.

Specifically, the detection module 503 may use a text classification model trained based on the embodiment shown in fig. 1 or fig. 2 to determine whether the text to be detected is a compliant text. For example, if a "0" indicates a compliant text and a "1" indicates an unconventional text, a return value of the text classification model of 0 indicates that the text to be detected is compliant, and a return value of the text classification model of 1 indicates that the text to be detected is not compliant.

In the embodiment of the invention, the automatic and intelligent text compliance detection is realized through the preprocessing module, the construction module and the detection module, so that the labor cost and the time cost in the text compliance detection are greatly reduced, and the accuracy of the text compliance detection is improved.

Fig. 6 shows an exemplary system architecture 600 to which the text classification method or the text classification apparatus of the embodiments of the present invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The

terminal devices

601, 602, 603 may have various communication client applications installed thereon, such as insurance applications, shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.

The

terminal devices

601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 605 may be a server that provides various services, such as a background management server that supports insurance-like websites browsed by users using the

terminal devices

601, 602, 603. The background management server may analyze and perform other processing on the received data such as the text compliance detection request, and feed back a processing result (e.g., a text compliance detection result) to the terminal device.

It should be noted that the text classification method provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the text classification apparatus is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with the electronic device implementing an embodiment of the present invention. The computer system illustrated in FIG. 7 is only an example and should not impose any limitations on the scope of use or functionality of embodiments of the invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a construction module, a determination module, an update module, and a classification module. Where the names of these modules do not in some cases constitute a limitation of the module itself, for example, a building module may also be described as a "module that builds a first set of training samples based on text data carrying category labels, and a second set of training samples based on text data not carrying category labels".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to perform the following: constructing a first training sample set based on the text data carrying the class labels, and constructing a second training sample set based on the text data not carrying the class labels; performing supervised learning on the first training sample set, and determining the value of a loss function under the supervised learning; carrying out unsupervised learning on the second training sample set, and determining the value of a loss function under the unsupervised learning; determining a value of a mixed loss function according to the value of the loss function under supervised learning and the value of the loss function under unsupervised learning; updating parameters of the text classification model according to the values of the mixing loss function to obtain an updated text classification model; under the condition that the updated text classification model meets the iterative training stopping condition, taking the updated text classification model as a trained text classification model; and processing the text to be detected based on the trained text classification model so as to determine the category of the text to be detected.

According to the technical scheme of the embodiment of the invention, the training efficiency of the text classification model can be improved, and the classification accuracy of the final text classification can be improved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of text classification, the method comprising:

constructing a first training sample set based on the text data carrying the class labels, and constructing a second training sample set based on the text data not carrying the class labels;

performing supervised learning on the first training sample set, and determining the value of a loss function under the supervised learning; carrying out unsupervised learning on the second training sample set, and determining the value of a loss function under the unsupervised learning; determining a value of a mixed loss function according to the value of the loss function under supervised learning and the value of the loss function under unsupervised learning;

updating parameters of the text classification model according to the values of the mixing loss function to obtain an updated text classification model; under the condition that the updated text classification model meets the iterative training stopping condition, taking the updated text classification model as a trained text classification model;

and processing the text to be detected based on the trained text classification model so as to determine the category of the text to be detected.

2. The method of claim 1, wherein the supervised learning of the first set of training samples and determining the value of the loss function under supervised learning comprises:

inputting the first training sample set into a text classification model to obtain a first output value of the text classification model, and determining a first value of a first loss function according to the first output value of the text classification model; performing countermeasure disturbance on the first training sample set to obtain a countermeasure sample set; inputting the confrontation sample set into the text classification model to obtain a second output value of the text classification model, and determining a second value of the first loss function according to the second output value of the text classification model; wherein the text classification model is a pre-training model.

3. The method of claim 2, wherein the first loss function is a cross-entropy loss function or a conditional entropy loss function.

4. The method of claim 3, wherein the performing a countering perturbation on the first training sample set to obtain a countering sample set when the first loss function is a cross-entropy loss function comprises:

determining a disturbance value for resisting disturbance to the first training sample set according to the following formula, and injecting the disturbance value into the first training sample set to obtain the resisting sample set:

r_adv＝g/||g||₂；

v′＝v+r_adv；

wherein r is_advRepresenting a disturbance value; representing control disturbancesA coefficient of magnitude; g represents a gradient value obtained by deriving the parameters of the embedded layer after the first output value of the text classification model is calculated; | g | calculation of luminance₂L representing a gradient value₂A paradigm;

5. The method of claim 2, wherein the unsupervised learning of the second set of training samples and determining the value of the loss function under unsupervised learning comprises:

inputting the second training sample set into a text classification model to obtain a third output value of the text classification model, and determining a first value of a second loss function according to the third output value of the text classification model; performing virtual countermeasure disturbance on the second training sample set to obtain a virtual countermeasure sample set; and inputting the virtual countermeasure sample set into the text classification model to obtain a fourth output value of the text classification model, and determining a second value of a second loss function according to the fourth output value of the text classification model.

6. The method of claim 5, wherein the second loss function is a conditional entropy loss function; the performing virtual confrontation perturbation on the second training sample set to obtain a virtual confrontation sample set includes:

performing countermeasure disturbance on the second training set according to the random initialization vector to obtain a countermeasure sample set; inputting the confrontation sample set into the text classification model to obtain a fifth output value of the text classification model; calculating a KL divergence value according to the third output value and the fifth output value of the text classification model; and determining a disturbance value for performing virtual countermeasure disturbance on the second training sample set according to the KL scattering value, and injecting the disturbance value into the second training sample set to obtain the virtual countermeasure sample set.

7. The method according to claim 1, wherein the processing the text to be detected based on the trained text classification model to determine the category of the text to be detected comprises:

preprocessing a text to be detected, and constructing a feature vector to be detected according to the preprocessed text; and inputting the characteristic vector to be detected into a trained text classification model to determine the category of the text to be detected.

8. An apparatus for classifying text, the apparatus comprising:

the building module is used for building a first training sample set based on the text data carrying the category labels and building a second training sample set based on the text data not carrying the category labels;

the determining module is used for carrying out supervised learning on the first training sample set and determining the value of a loss function under the supervised learning; carrying out unsupervised learning on the second training sample set, and determining the value of a loss function under the unsupervised learning; determining a value of a mixed loss function according to the value of the loss function under supervised learning and the value of the loss function under unsupervised learning;

the updating module is used for updating the parameters of the text classification model according to the values of the mixing loss function; the text classification module is also used for taking the updated text classification model as a trained text classification model under the condition that the updated text classification model meets the iterative training stopping condition;

and the classification module is used for processing the text to be detected based on the trained text classification model so as to determine the category of the text to be detected.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7.