CN113177119A - Text classification model training and classifying method and system and data processing system - Google Patents
Text classification model training and classifying method and system and data processing system Download PDFInfo
- Publication number
- CN113177119A CN113177119A CN202110494682.6A CN202110494682A CN113177119A CN 113177119 A CN113177119 A CN 113177119A CN 202110494682 A CN202110494682 A CN 202110494682A CN 113177119 A CN113177119 A CN 113177119A
- Authority
- CN
- China
- Prior art keywords
- text
- sample set
- sample
- labeled
- mixed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 110
- 238000013145 classification model Methods 0.000 title claims abstract description 108
- 238000000034 method Methods 0.000 title claims abstract description 79
- 238000012545 processing Methods 0.000 title claims abstract description 47
- 238000002156 mixing Methods 0.000 claims abstract description 39
- 238000013528 artificial neural network Methods 0.000 claims abstract description 37
- 238000002372 labelling Methods 0.000 claims abstract description 21
- 239000013598 vector Substances 0.000 claims description 129
- 230000006870 function Effects 0.000 claims description 25
- 230000015654 memory Effects 0.000 claims description 17
- 238000003860 storage Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 230000001965 increasing effect Effects 0.000 claims description 9
- 230000008451 emotion Effects 0.000 claims description 8
- 238000005520 cutting process Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 23
- 239000000203 mixture Substances 0.000 description 7
- 238000000605 extraction Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 230000002708 enhancing effect Effects 0.000 description 4
- 238000002360 preparation method Methods 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 4
- 238000006467 substitution reaction Methods 0.000 description 4
- 230000002457 bidirectional effect Effects 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000013526 transfer learning Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure provides a text classification model training and classifying method and system and a data processing system, and relates to the technical field of data processing. The text classification model training method comprises the following steps: in each round of training, determining a classification estimation value of each sample in an unlabeled sample set based on a text classification model to be trained, and acquiring an estimated labeled sample set; processing the labeled sample set and estimating a labeled sample set through an encoder to be trained; acquiring a mixed sample set according to the processed labeled sample set, the estimated labeled sample set and the estimated mixing coefficient; inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjusting parameters of a text classification model to be trained according to a loss function; and when the number of training rounds reaches the preset number, acquiring a text classification model. By the method, the manual labeling requirement can be reduced, and the model training efficiency is improved.
Description
Technical Field
The disclosure relates to the technical field of data processing, in particular to a text classification model training and classification method and system and a data processing system.
Background
User reviews are one of the basic functions of many internet websites. According to the comment content of the user, the feedback opinions of the user can be conveniently obtained and adjusted.
Because the input amount of user comments is huge, the manual discrimination mode is difficult to support, and therefore the discrimination efficiency needs to be improved by means of the automatic machine recognition mode. In the identification process, the evaluation contents are generally classified into good, bad and medium.
In the related art, the category of the comment can be identified by setting a predetermined rule, and for example, a comment containing a word such as "spam", "bad", or the like is determined as a bad comment.
In addition, a machine learning or deep learning algorithm (e.g., RNN (Recurrent Neural Network)/LSTM ((Long Short Term Memory/Long Short Term Memory)/BiLSTM (Bi-directional Long Short Term Memory/Bi-directional Short Term Memory) or the like) may be introduced, and after a large number of samples are input to the model and trained by means of supervised learning, general good comments or bad comments can be recognized, or a Pre-training language model (e.g., Word2 Vector/BERT (Word to Vector)/BERT (Bidirectional Encoder from Transformers, Bidirectional encoding characterization based on Transformers)/GPT (general purpose Pre-transmitted Transformer, general Pre-training Transformer) or the like) may be introduced to bring semantics of the large-scale corpus and then input a large number of samples to perform Fine tuning (Fine tuning), a final model that can identify good/bad scores is obtained.
Disclosure of Invention
One purpose of the present disclosure is to reduce the amount of data required in the text classification model training process on the basis of ensuring the accuracy of text classification.
According to an aspect of some embodiments of the present disclosure, a text classification model training method is provided, including: in each round of the training, the training is carried out,
determining a classification estimation value of each sample in an unlabeled sample set based on a text classification model to be trained, and acquiring an estimated labeled sample set; obtaining a vector of the text marked with the sample set through an encoder to be trained, and estimating the vector of the text marked with the sample set; acquiring a mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identification of the labeled sample; acquiring a mixed sample set according to the mixed labeled sample set, the vector and the classification estimation value of the text of the sample of the estimated labeled sample set and the estimated mixing coefficient; inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjusting parameters of an encoder to be trained, a text classification model to be trained and the feedforward neural network according to a loss value obtained based on a loss function; and when the number of training rounds reaches the preset number, acquiring a text classification model.
In some embodiments, the text classification model training method further comprises: before the sample set is input into the encoder to be trained, the labeled sample set is expanded according to the labeled samples until the sample amount of the labeled sample set is equal to that of the unlabeled sample set.
In some embodiments, each labeled sample in the set of labeled samples includes text of an original sample, text of an enhanced sample of an original sample, and a category identification of the original sample.
In some embodiments, the enhanced samples of the original sample comprise at least one of a first enhanced sample or a second enhanced sample; the text of the first enhanced sample is generated by carrying out synonym replacement on the text of the original sample; the text of the second enhanced sample is generated after the text of the original sample is translated into the second language and then translated back into the original language.
In some embodiments, the text classification model training method further comprises: and generating a labeled sample set in advance according to the original sample of the labeled category.
In some embodiments, obtaining, by an encoder to be trained, a vector of text of a labeled sample set, and estimating the vector of text of the labeled sample set comprises: inputting texts of samples in the labeled sample set and the estimated labeled sample set into an encoder to be trained in batches by taking the size of a preset batch as a unit, and acquiring vectors of the texts of the labeled sample set and the estimated labeled sample set of each batch; obtaining the mixed annotated sample set and obtaining the mixed sample set comprises: acquiring a mixed labeled sample set of each batch and a mixed sample set of a corresponding batch; the parameters for adjusting the encoder to be trained, the text classification model to be trained, and the feed-forward neural network include: and respectively inputting the mixed labeled sample set and the mixed sample set of each batch into a feedforward neural network, and adjusting parameters of an encoder to be trained, a text classification model to be trained and the feedforward neural network according to a loss value obtained based on a loss function until the mixed labeled sample set and the mixed sample set of all batches in the current training round are processed.
In some embodiments, the text classification model training method further comprises: after the marked sample set is expanded, according to a preset batch size, sequentially extracting a text of an original sample, a text of an enhanced sample and a text of an estimated marked sample in the expanded marked sample set and the estimated marked sample set respectively, wherein the marked sample comprises the text of the original sample and the text of the enhanced sample of an original sample; generating a text vector to be coded according to the text of the original sample, the text of the enhanced sample and the text of the estimated marked sample, wherein the text vector to be coded comprises original sample dimensionality, enhanced sample dimensionality and estimated marked sample dimensionality, and the number of sample texts in each dimensionality accords with a preset batch size; cutting a text in a text vector to be coded according to a preset text length upper limit; obtaining the vector of the text of the labeled sample set of each batch and estimating the vector of the text of the labeled sample set comprises: inputting the cut text vectors to be coded into a coder to be trained, and acquiring the text coding vectors of the current batch; extracting elements of original sample dimensionality and enhanced sample dimensionality in the text coding vector to obtain a vector of a labeled sample set; and extracting elements of the dimension of the estimation labeling sample in the text coding vector to obtain the vector of the estimation labeling sample set.
In some embodiments, obtaining the mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identifier of the labeled sample includes: acquiring mixed marked sample codes according to the codes corresponding to the original sample and the enhanced sample in the vector of the text of each marked sample of the marked sample set and the enhanced mixed coefficient; obtaining a vector of the mixed labeled sample set according to the mixed labeled sample code and the code corresponding to the original sample; and acquiring the mixed labeled sample set according to the vector of the mixed labeled sample set and the corresponding category identification of the original sample.
In some embodiments, obtaining the mixed annotated sample encoding comprises: and taking the enhanced mixed coefficient as the weight of the code corresponding to the enhanced sample, and adding the weight and the code corresponding to the original sample to obtain the mixed marked sample code.
In some embodiments, obtaining the mixed sample set from mixing the labeled sample set, estimating a vector of text and a classification estimation value for the samples of the labeled sample set, and estimating a mixing coefficient comprises: acquiring a coding estimation labeling sample set according to the vector of the text of the estimation labeling sample set and the category identification of the estimation labeling sample; estimating, for each sample in a vector of the set of annotated samples, for the encoding: respectively extracting and mixing one sample in the marked sample set at random; and respectively calculating the weighted sum of the vector of the text and the class identifier by taking the first estimation mixing coefficient as the weight of the samples in the coding estimation labeling sample set and the second estimation mixing coefficient as the weight of the extracted mixed labeled samples in the sample set, and acquiring the samples of the mixed sample set, wherein the first estimation mixing coefficient and the second estimation mixing coefficient are 1.
In some embodiments, the text classification model training method further comprises: after each round of training is completed, the first estimated mixing coefficient is increased by a predetermined ratio.
In some embodiments, inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjusting parameters of an encoder to be trained, a text classification model to be trained, and the feedforward neural network according to a loss value obtained based on a loss function includes: inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and outputting a processing result through a full connection layer; and inputting the processing result into a loss function to obtain a loss value.
In some embodiments, inputting the processing result into a loss function, and obtaining the loss value comprises: acquiring cross entropy loss as a first loss value according to a processing result based on the mixed labeled sample set; acquiring a mean square error loss as a second loss value according to a processing result based on the mixed sample set; and acquiring a weighted value of the first loss value and the second loss value according to the preset loss value weight to serve as the loss value.
In some embodiments, the category identification comprises an emotion category identification.
According to an aspect of some embodiments of the present disclosure, there is provided a text classification method, including: inputting a text to be classified into a text classification model, wherein the text classification model is generated by training according to any one of the text classification model training methods mentioned above; and taking the classification estimation value output by the text classification model as the class of the text to be classified.
According to an aspect of some embodiments of the present disclosure, there is provided a text classification model training system, including: the estimated sample set obtaining unit is configured to determine a classification estimation value of each sample in an unlabeled sample set based on a text classification model to be trained in each round of training, and obtain an estimated labeled sample set; a vector acquisition unit configured to acquire, by an encoder to be trained, a vector of the text of the labeled sample set, and to estimate the vector of the text of the labeled sample set; the mixing unit is configured to obtain a mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identification of the labeled sample; acquiring a mixed sample set according to the mixed labeled sample set, the vector and the classification estimation value of the text of the sample of the estimated labeled sample set and the estimated mixing coefficient; the parameter adjusting unit is configured to input the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjust parameters of an encoder to be trained, a text classification model to be trained and the feedforward neural network according to a loss value obtained based on a loss function; a model obtaining unit configured to obtain the text classification model when the number of training rounds reaches a predetermined number of times.
According to an aspect of some embodiments of the present disclosure, there is provided a text classification system including: the text input unit is configured to input texts to be classified into a text classification model, wherein the text classification model is generated by training according to any one of the text classification model training methods; and the class determination unit is configured to take the classification estimation value output by the text classification model as the class of the text to be classified.
According to an aspect of some embodiments of the present disclosure, there is provided a data processing system, comprising: a memory; and a processor coupled to the memory, the processor configured to perform any of the methods mentioned above based on instructions stored in the memory.
According to an aspect of some embodiments of the present disclosure, a computer-readable storage medium is proposed, on which computer program instructions are stored, which instructions, when executed by a processor, implement the steps of any one of the methods mentioned above.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:
fig. 1 is a flow diagram of some embodiments of a text classification model training method of the present disclosure.
FIG. 2 is a flow diagram of some embodiments of a single batch per round of data processing in a text classification model training method of the present disclosure.
FIG. 3 is a flow diagram of some embodiments of parameter adjustment in a text classification model training method of the present disclosure.
Fig. 4 is a flow diagram of some embodiments of a text classification method of the present disclosure.
FIG. 5 is a schematic diagram of some embodiments of a text classification model training system of the present disclosure.
Fig. 6 is a schematic diagram of some embodiments of a text classification system of the present disclosure.
FIG. 7 is a schematic diagram of some embodiments of data processing systems of the present disclosure.
FIG. 8 is a schematic diagram of further embodiments of data processing systems according to the present disclosure.
Detailed Description
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
A flow diagram of some embodiments of a text classification model training method of the present disclosure is shown in fig. 1.
In step 101, in the current round of training, based on the text classification model to be trained, the classification estimation value y of each sample u in the unlabeled sample set is determineduObtaining an estimation marking sample set, wherein each sample in the estimation marking sample set is (u, y)u). The number of samples in the unlabeled sample set, that is, the samples of the class to which the unlabeled samples belong, may be much larger than the number of samples in the labeled sample set.
The category can be an emotion category, and the identification comprises an emotion category identification, for example, the bad score is 0, the medium score is 1, and the good score is 2; or a good score of 2, a medium score of 1, a poor score of 0, etc. The classification estimation value is a class identifier corresponding to a sample estimated by the text classification model to be trained, and for example, the estimation value is any one of 0,1 and 2.
In some embodiments, the text classification model to be trained is a machine learning model, and a public Chinese emotion analysis data set can be used as a transfer learning training sample in advance for training, for example, data in public Chinese microblog emotion analysis is used for training, so that universal basic training is realized, the number of rounds required by subsequent training is reduced, and the training efficiency is improved.
In step 102, a vector of the text of the labeled sample set is obtained by the encoder to be trained, and the vector of the text of the labeled sample set is estimated. In some embodiments, the labeled sample set may be generated by manually labeling the extracted original sample.
In some embodiments, the labeled sample set may be a sample set generated after performing an enhancement operation on the labeled original sample. In some embodiments, the enhancement operation may include synonym substitution. In some embodiments, the enhancement operation may include translating the text of the sample into a second language, such as English, and then back into the original language.
In some embodiments, each labeled sample in the set of labeled samples includes text of an original sample, text of an enhanced sample of an original sample, and a category identification of the original sample. For example, the original sample has text s and the enhanced sample has s1The category of the original sample is marked as ysThen the samples in the labeled sample set are (s, s)1,ys). In some embodiments, each labeled sample may include text of two enhanced samples, which are generated by different enhancement operations, for example, the sample in the labeled sample set is (s, s)1,s2,ys)。
In some embodiments, the labeled sample set may be randomly shuffled, for example, random shuffle may be performed by using a tensoflow function of tensoflow, so as to increase randomness of the order of the samples in the labeled sample set and reduce training bias.
In step 103, a mixed labeled sample set is obtained according to the vector of the text of each labeled sample in the labeled sample set and the category identifier of each labeled sample.
In some embodiments, the set of mixed labeled samples may be a set that includes a vector of text of the original sample and a category identification of the original sample, and a vector of text of the enhanced sample and a category identification of the enhanced sample.
In some embodiments, the text vectors in the labeled sample set may be blended with each other to generate text that blends the samples in the labeled sample set. In some embodiments, the blending of the set of labeled samples may include a text vector of an original sample in the labeled sample, and a weighted sum of the text vector of the original sample and a text vector of an enhanced sample corresponding to the original sample, where the enhanced blending coefficient is a weight of the text vector of the enhanced sample.
In step 104, a mixed sample set is obtained according to the mixed labeled sample set, the vector and the estimated mixing coefficient of the text of the estimated labeled sample set, and the classification estimation value. In some embodiments, the estimated mixture coefficients may include a first estimated mixture coefficient and a second estimated mixture coefficient as weights for mixing the samples in the labeled sample set and the labeled sample set, respectively, and a weighted sum of the text vector and the category identification is obtained as a weighted sum of the samples in the mixed sample set, respectively. In some embodiments, samples may be randomly drawn from the mixed set of labeled samples, mixed with a vector of text that estimates the set of labeled samples, and a category identification.
In step 105, the mixed labeled sample set and the mixed sample set are input into a feedforward neural network, and parameters of an encoder to be trained, a text classification model to be trained, and the feedforward neural network are adjusted according to a loss value obtained based on a loss function.
In step 106, it is determined whether the number of training rounds has reached a predetermined number of times. In some embodiments, the number of training rounds may be set to N (N is a preset positive integer), and the current round is identified as i. When i is less than or equal to N, i +1 is executed, and a batch of samples is obtained again, and step 101 is executed. If i > N, step 107 is performed.
In step 107, the training of the text classification model to be trained is completed, and a trained text classification model is obtained. In some embodiments, in the 1 st round of training, the text classification model to be trained is marked as model M0(ii) a In the training of the ith round, the text classification model to be trained is marked as a model Mi-1Then at the completion of training, the model is MN。MNI.e. the required text classification model.
By the method, the cyclic training of the text classification model can be realized through the mixed analysis of a small amount of labeled samples and unlabeled samples, and the requirement on the number of labeled samples required in the training process of the text classification model is reduced on the basis of ensuring the accuracy of the trained text classification model, so that the manual labeling requirement is reduced, and the model training efficiency is improved; in the absence of labeled samples, the accuracy of the text classification model is improved.
In some embodiments, before starting the training of the current round as shown in step 101, a labeled sample set S' including the text of the original sample, the text of the enhanced sample and the analog identification may be generated in advance from the sample set S including only the original sample and the category identification thereof.
In some embodiments, synonym substitutions are made for the original text. For example, the method replaces the 'use at present, large capacity, random download and storage of information and quick charging' with the 'use at present, large capacity, random download and storage of data and quick charging'. In some embodiments, synonym replacement may be performed on the original text using a published dictionary of chinese synonyms, such as "forest of hayawara synonyms," resulting in an enhanced sample text.
In some embodiments, the original text may be translated twice, from the original language to the second language, and then translated back to the original language, with the translated back text being used as the text of the enhanced sample. For example, translate Chinese to English and then back to Chinese. For example, "things are good, cost performance is high, express delivery is fast" is translated into "It's good, cost-effective and fast delivery", and then translated back into Chinese "good, cost performance is high, delivery is fast". In some embodiments, any machine translation engine may be employed, for example using an Apertium translation engine.
In some embodiments, the sample enhancement operation may be performed in the above two ways at the same time, and the text s of the first enhanced sample is generated separately1And text s of a second enhancement sample2Generating samples (S, S) in the labeled sample set S1,s2,ys)。
By the method, the marked sample amount can be expanded by a mode of enhancing sample operation with less operation amount, the demand of manually marking sample types is reduced, and the model training efficiency is improved.
In some embodiments, between the above steps 101 and 102, the labeled sample set may be expanded according to the labeled sample until the amount of samples in the labeled sample set is equal to the amount of samples in the unlabeled sample set. In some embodiments, the expansion method may be to sequentially take samples from the labeled sample set and append the samples to the tail of the labeled sample set until the amount of samples in the labeled sample set is equal to the amount of samples in the unlabeled sample set.
By the method, the utilization rate of limited marked samples can be improved, and the convenience and reliability of subsequent data processing can be improved.
In some embodiments, samples may be batched into the encoder to be trained in each pass, reducing the data processing burden on the encoder. A flow diagram of some embodiments of a single batch data processing per round in a text classification model training method of the present disclosure is shown in fig. 2.
In step 201, according to the predetermined batch size b, the text of the original sample, the text of the enhanced sample and the text of the estimated labeled sample are sequentially extracted from the expanded labeled sample set and the estimated labeled sample set, respectively. The batch size refers to the number of samples in a single batch and may also be referred to as the batch size. In some embodiments, Sample extraction may be performed using Batch sampling (Batch Sample). In some embodiments, before the text is extracted in batches, the labeled sample set is expanded according to the labeled samples until the sample amount of the labeled sample set is equal to the sample amount of the unlabeled sample set, so that synchronous extraction and synchronous extraction ending of the labeled sample set and the unlabeled sample set can be ensured, and the probability of data processing failure is reduced. Suppose that the batch of samples taken from the labeled sample set is (YS) The sample of the batch taken from U' is (A)YU) WhereinSample texts that are all one batch, i.e. they all comprise b texts; y isS、YUThe labeled and estimated b categories are identified (0/1/2), respectively.
In step 202, the text s of the extracted original text and the text of the enhancement sample (s is taken as an example if the first enhancement sample and the second enhancement sample exist)1,s2) And estimating the text u of the labeled sample to generate a text T to be coded. In some embodiments, the number of each text conforms to a predetermined batch size b (in some embodiments, b may be 16), and a text vector of the original sample is extractedText of enhanced samplesAnd estimating text of the annotated sampleSplicing the text vectors (for example, adopting a concat function of Tensorflow to synthesize the text vectors into a vector in a mode of increasing vector dimensions), and generating the text vectors to be coded of the current batch Is 4 b.
In step 203, the text vectors to be encoded of the current batch are processedInputting into a coder to be trained, and obtaining the vector of the text of the labeled sample set, including the vector X of the text of the original sample and the vector of the text of the enhanced sample, such as X1、X2The vector X of the text of the estimated labeled sample set can also be obtainedU. In some embodiments, the encoder to be trained may be a BERT encoder to be trained.
In some embodiments, when a text vector is to be encodedAfter passing through the encoder to be trained, vectors, such as (X, X), respectively formed by the encoding results of the original sample text, the vector of the text of the enhanced sample, and the vector of the text of the estimated labeled sample can be generated1,X2,XU) Splitting the original sample into an original sample dimension X and an enhanced sample dimension X according to the dimensions1、X2And estimating the dimension X of the labeled sampleU。
In some embodiments, the text vector to be encodedBefore the input of the encoder to be trained, preprocessing can also be performed, such as clipping the text in the text vector to be encoded according to a predetermined text length upper limit. To be provided withThe method includes the following steps that 4 × b texts are taken as an example, and cutting is performed, namely, all parts of each text with the length exceeding L are discarded (assuming that L is 512); then will beInput to the encoder as a small 64 sample batch. For each sample of text, the encoder outputs the encoding result. In some embodiments, the encoder is arranged to encode each text as a d-dimensional vector, for example, let d 768.
In step 204, a mixed labeled sample set and a mixed sample set of the current batch are obtained.
In some embodiments, the mixed labeled sample set may be obtained according to the vector and the enhanced mixing coefficient of the text of each labeled sample in the labeled sample set, and the category identifier of each sample. In some embodiments, the mixing of the text vectors in the labeled sample set may include the text vector of the original sample in the labeled sample, and a weighted sum of the text vector of the original sample and the text vector of the enhanced sample corresponding to the original sample, where the enhanced mixing coefficient is a weight of the text vector of the enhanced sample.
In some embodiments, vector X corresponding to the original sample and vector X corresponding to the enhancement sample may be based on the original sample1、X2And enhancing the blending coefficient mu to obtain a blended labeled sample code X1’、X2' for example, the enhanced mixture coefficient is added to the vector corresponding to the original sample as the weight of the vector corresponding to the enhanced sample to obtain the mixed labeled sample vector corresponding to the entry, that is, the vector
X1’=X+μX1;
X2’=X+μX2;
And further, obtaining the vector (X, X) of the mixed marked sample set according to the mixed marked sample vector and the vector corresponding to the original sample1’、X2') to a host; combining the vector of the mixed labeled sample set with the corresponding category identification Y of the original sample to obtain a mixed labeled sample set ((X, Y)s),(X1’,Ys),(X2’,Ys)). After such an operation, the number of samples in the labeled sample set is mixed to 3b, which is abbreviated as (X)s’,Ys) Wherein the vector is (x)s’,ys). In some embodiments, for each labeled sample in the labeled sample set, in the vector of text for that labeled sample, the code x corresponding to the original sample and the code x corresponding to the enhanced sample are modified1、x2And enhancing the blending coefficient mu to obtain a blended labeled sample code x1’、x2', such as x1’=x+μx1,x2’=x+μx2(ii) a Obtaining a mixed marked sample set, wherein three samples exist in the same corresponding original sample, namely (x, y)s),(x1’,ys) And (x)2’,ys)。
The inventor finds that in the sample enhancement process, reasonable classification of the text of the enhanced sample and the original sample can be kept consistent in most cases, but the enhanced sample can be distorted or even overturned in individual cases. For example, the comment of a certain charger is 'too amazing', after the charging is carried out for a lot of 'It' samazing. Such enhanced samples can adversely affect subsequent analysis.
Through the method of the embodiment, the operation of semantic distortion and even turning which may occur in the sample enhancement process can be overcome, the parameter mu is a variable super parameter, and the larger the mu is, the stronger the interference of the enhancement sample on the model effect is; the smaller μ, the weaker the sample enhancement. In some embodiments, μmay be set to 0.5, so as to reduce the negative impact caused by semantic torsion that may occur on the basis of sample expansion using the enhanced sample, and improve the accuracy of the trained model.
In some embodiments, on the basis of obtaining the mixed labeled sample set by any one of the above manners, the samples in the mixed labeled sample set may be further mixed with the vector of the text of the estimated labeled sample set and the category identification of the estimated labeled sample thereof to obtain a mixed sample set.
In some embodiments, the vector X of text labeling the sample set is estimated based on the vector XUAnd estimating class identity Y of the annotated sampleUObtaining a set of code estimation labeling samples (X)U,YU) Estimating the annotated sample set (X)U,YU) Including b samples. The estimate further annotates each sample (x) in the vector of the sample set for the encodingU,yU) Separately, the labeled sample sets (X) are randomly extracted and mixeds’,Ys) One sample (x)s’,ys),(Xs’,Ys) Including 3b samples. Respectively calculating the weighted sum of the vector of the text and the class identifier by taking the first estimated mixing coefficient lambda as the weight of the samples in the coding estimation labeled sample set and the second estimated mixing coefficient (1-lambda) as the weight of the extracted mixed labeled sample set, namely according to the formula:
xU’=λxU+(1-λ)xs’
yU’=λyU+(1-λ)ys’
obtaining a mixed sampleSample of set (x)U’,yU'). Through pair (X)U,YU) Mixing each sample to obtain mixed sample set (X)U’,YU’)。
In some embodiments, the first estimated mixture coefficient λ is a variable hyperparameter, and as λ is larger, the mixed samples approach the estimated labeled samples more closely; as λ gets smaller, the mixed sample gets closer to the labeled sample. In some embodiments, λ may be set manually empirically.
In other embodiments, λ may be a dynamic value, and the first estimated mixing coefficient is increased by a predetermined ratio after each round of training is completed. For example, in an early stage of i-small training, such as the first and second rounds of training, if the accuracy of the model is low and the accuracy of the estimated labeled sample is poor, λ is set to be small, so that the mixed sample is closer to the labeled sample; and increasing the accuracy of the text classification model gradually along with the increase of the i, and increasing the lambda so that the estimation labeling sample gradually plays a role and the generalization capability of the model is enhanced. In some embodiments, the λ value of the ith round may be set to λiThen there is λi=λ(i-1)Eta, e.g. specifying lambda1=0.1,η=1.1。
By the method, excessive interference on training caused by samples with poor accuracy can be avoided by estimating the adjustment of the mixing coefficient, the generalization capability is gradually improved, and the training efficiency is improved.
Through the operations in step 204, a hybrid labeled sample set (X) is obtaineds’,Ys) And mixed sample set (X)U’,YU') of which one or more,and Y isSThe value of (a) is one of integers 0,1 and 2;wherein Y isU' real numbers with values of 0 to 2.
In step 205, the mixed labeled sample set and the mixed sample set are input into a feedforward neural network, and parameters of an encoder to be trained, a text classification model to be trained, and the feedforward neural network are adjusted according to a loss value obtained based on a loss function.
In some embodiments, let (X)S′,YS) The output is (X) after passing through a feedforward neural networkS+′,YS) Input (X)U′,YU') post output is (X)U+′,YY'). To be output (X)S+′,YS) And (X)U+′,YU') inputting the loss function to obtain a loss value, reversely propagating the loss value, and adjusting parameters of the encoder to be trained, the text classification model to be trained and the feedforward neural network.
In step 206, it is determined whether all samples in the labeled sample set and the estimated labeled sample set have been extracted. If all samples in the labeled sample set and the estimated labeled sample set are extracted, ending the training process of the current round; otherwise, step 201 is executed to extract the text of the subsequent sample in the sample set, continuing the text extraction progress in the previous cycle.
By the method, sample texts can be extracted in batches for processing, and sample data can be fully utilized through multi-batch processing; the data volume of each time of processing is reduced, the operation burden of each link is reduced, and the reliability and the efficiency of training are improved.
A flow diagram of some embodiments of a parameter adjustment portion of the text classification model training method of the present disclosure is shown in fig. 3. In some embodiments, the parameter adjustments shown below may be a detailed expansion of step 205 above.
In step 301, the mixed labeled sample set and the mixed sample set are input into a feedforward neural network, and a processing result is output through a full connection layer.
In some embodiments, assuming the sample feature is x, after entering the feedforward neural network, first go through a full join layer: y is1=Relu(ω1x+b1) Wherein A reciu (Rectified Linear Unit, also called a modified Linear Unit) is a commonly used activation function in an artificial neural network;is a real number matrix. In some embodiments, let D2048; then passing through a full connecting layer y2=ω2y1+b2WhereinThe feedforward neural network can strengthen the nonlinear characteristic and enhance the adaptation capability of the BERT after the transfer learning to the actual task. Is provided with (X)S′,YS) The output is (X) after passing through a feedforward neural networkS+′,YS) Input (X)U′,YU') post output is (X)U+′,YU′)。
Step 302 and step 303 may subsequently be performed in parallel.
In step 302, cross-entropy losses are obtained as first loss values from the processing results based on the mixed set of labeled samples. In some embodiments, the feedforward neural network may be input to the Softmax layer based on the processing result of mixing the labeled sample sets, and the cross-entropy LOSS is obtained as the first LOSS value LOSSS。
In step 303, a mean square error loss is obtained as a second loss value from the processing result based on the mixed sample set. In some embodiments, the processing result based on the mixed sample set may be input to a linear regression layer to obtain a mean square error LOSS as a second LOSS value LOSSU。
Due to YU' generated by the calculation in step 204, the value of each term may be a non-integer, for example, Y is given by category labels 0,1, and 2UEach term in' can take on real numbers between 0 and 2, thus (X)U+,YU') is defined by the mean square error, ensuring that Y can be matchedUThe amount of information in' is handled efficiently.
In step 304, a weighted value of the first loss value and the second loss value is obtained as a loss value according to a predetermined loss value weight. In some embodiments, LOSS is obtainedSAnd LOSSUThen, based on the formula:
LOSS=δ*LOSSU+(1-δ)*LOSSS
obtaining a LOSS value LOSS, wherein delta is LOSSUWeight of (1-delta) is LOSSSThe weight of (c). In some embodiments, δ is a hyperparameter, which determines the impact of unlabeled samples in the model training, and may be set empirically or adjusted during use based on the effect. In some embodiments, δ may take the value 0.25.
In step 305, the LOSS value LOSS is propagated backward, and parameters of the encoder to be trained, the text classification model to be trained, and the feedforward neural network are adjusted.
By the method, the problem that the class identifiers of the samples in the mixed sample set are non-integer can be considered, and the class identifiers are effectively utilized by selecting a proper loss function; in addition, the influence of the unlabeled sample in the training process can be flexibly adjusted through setting delta, so that a user can freely adjust the influence according to the requirements of efficiency and accuracy, and the controllability is improved.
A flow diagram of some embodiments of a text classification method of the present disclosure is shown in fig. 4.
In step 401, the text to be classified is input into a text classification model. Text classification model MNGenerated by a training method of any one of the text classification models.
In step 402, the classification estimation value output by the text classification model is taken as the category of the text to be classified. In some embodiments, the category may be an emotion category, and the identification includes an emotion category identification, such as a bad rating of 0, a medium rating of 1, and a good rating of 2; or a good score of 2, a medium score of 1, a poor score of 0, etc.
By the method, the text classification is carried out by adopting the text classification model trained on the basis of a small amount of labeled samples, so that the sample demand in the preparation process before classification is reduced on the basis of ensuring the accuracy, and the training efficiency is improved; under the condition of less labeled samples, the accuracy of text classification can be improved.
A schematic diagram of some embodiments of the text classification model training system of the present disclosure is shown in fig. 5.
The estimated sample set obtaining unit 501 can determine the classification estimation value y of each sample u in the unlabeled sample set based on the text classification model to be trained in each round of traininguObtaining an estimation marking sample set, wherein each sample in the estimation marking sample set is (u, y)u). The number of samples in the unlabeled sample set, that is, the samples of the class to which the unlabeled samples belong, may be much larger than the number of samples in the labeled sample set.
The vector acquisition unit 502 can acquire a vector of the text of the labeled sample set by the encoder to be trained, and estimate the vector of the text of the labeled sample set. In some embodiments, the labeled sample set may be generated by manually labeling the extracted original sample. In some embodiments, the labeled sample set may be a sample set generated after performing an enhancement operation on the labeled original sample. In some embodiments, the enhancement operation may include synonym substitution. In some embodiments, the enhancement operation may include translating the text of the sample into a second language, such as English, and then back into the original language.
A mixing unit 503 capable of generating a mixed labeled sample set and a mixed sample set. In some embodiments, the mixed labeled sample set is obtained according to the vector of the text of each labeled sample in the labeled sample set and the category identifier of each labeled sample. In some embodiments, the set of mixed labeled samples may be a set that includes a vector of text of the original sample and a category identification of the original sample, and a vector of text of the enhanced sample and a category identification of the enhanced sample. And further, acquiring a mixed sample set according to the mixed labeled sample set, the vector and the estimated mixing coefficient of the text of the estimated labeled sample set and the classification estimation value. .
The parameter adjusting unit 504 can input the mixed labeled sample set and the mixed sample set into the feedforward neural network, and adjust parameters of the encoder to be trained, the text classification model to be trained, and the feedforward neural network according to a loss value obtained based on the loss function.
The model obtaining unit 505 can obtain the trained text classification model in the case that the number of training rounds reaches the predetermined number of times. In some embodiments, in the 1 st round of training, the text classification model to be trained is marked as model M0(ii) a In the training of the ith round, the text classification model to be trained is marked as a model Mi-1Then at the completion of training, the model is MN。MNI.e. the required text classification model.
The text classification model training system can realize the cyclic training of the text classification model through the mixed analysis of a small amount of labeled samples and unlabeled samples, and reduces the number demand of labeled samples required in the text classification model training process on the basis of ensuring the accuracy of the trained text classification model, thereby reducing the manual labeling demand and improving the model training efficiency; in the absence of labeled samples, the accuracy of the text classification model is improved.
In some embodiments, the text classification model training system may further include a sample expansion unit, which is capable of expanding the labeled sample set according to the labeled sample until the sample size in the labeled sample set is equal to the sample size in the unlabeled sample set before the vector obtaining unit 502 inputs the sample set into the encoder to be trained.
The text classification model training system can improve the utilization rate of limited labeled samples, can ensure that the samples in a labeled sample set and an unlabeled sample set are synchronously extracted and finish the extraction, reduces the probability of data processing faults, and improves the convenience and reliability of subsequent data processing.
In some embodiments, the text classification model training system may further include a sample enhancement unit, which can pre-determine whether only the original sample and the class thereof are includedAnd generating a labeled sample set S' comprising the text of the original sample, the text of the enhanced sample and the analog identification. In some embodiments, the sample enhancement unit can perform synonym replacement on the original text to obtain the text s of the enhanced sample1. In some embodiments, the sample enhancement unit may further translate the original text twice, translate the original text from the original language to the second language, and then translate the original text back to the original language, and use the translated text as the text s of the enhanced sample2. In some embodiments, the sample enhancement unit may perform the sample enhancement operation in the above two ways at the same time, and respectively generate the text s of the first enhanced sample1And text s of a second enhancement sample2Generating samples (S, S) in the labeled sample set S1,s2,ys)。
The text classification model training system can expand the labeled sample amount with less operation amount by enhancing the sample operation mode, reduce the demand of manually labeling the sample category and improve the model training efficiency.
In some embodiments, the text classification model training system may further include: the batch extraction unit can respectively and sequentially extract the text of the original sample, the text of the enhanced sample and the text of the estimated marked sample in the expanded marked sample set and the estimated marked sample set according to the preset batch size after the marked sample set is expanded; generating a text vector to be coded according to the text of the original sample, the text of the enhanced sample and the text of the estimated marked sample, wherein the text vector to be coded comprises original sample dimensionality, enhanced sample dimensionality and estimated marked sample dimensionality, and the number of sample texts in each dimensionality accords with a preset batch size; and cutting the text in the text vector to be coded according to the upper limit of the preset text length.
The text classification model training system can extract sample texts in batches for processing, and sample data is fully utilized through multi-batch processing; the data volume of each time of processing is reduced, the operation burden of each link is reduced, and the reliability and the efficiency of training are improved.
In some embodiments, the textThe classification model training system may further include a coefficient adjustment unit capable of increasing the first estimated mixture coefficient by a predetermined ratio after each round of training is completed. For example, in an early stage of i, such as the first and second rounds of training, if the accuracy of the model is low and the accuracy of the estimated labeled sample is poor, λ is set to be small, so that the mixed sample is closer to the labeled sample; and increasing the lambda when the accuracy of the model is gradually increased along with the increase of the i, so that the estimation marking sample gradually plays a role, and the generalization capability of the model is enhanced. In some embodiments, the λ value of the ith round may be set to λiThen there is λi=λ(i-1)Eta, e.g. specifying lambda1=0.1,η=1.1。
The text classification model training system can avoid excessive interference of samples with poor accuracy on training through adjustment of the estimated mixing coefficient, gradually improve generalization capability and improve training efficiency.
A schematic diagram of some embodiments of the text classification system of the present disclosure is shown in fig. 6.
The text input unit 601 can input a text to be classified into a text classification model. Text classification model MNThe method is generated by any one of the above training methods of the text classification model or by any one of the above training systems of the text classification model.
The category determination unit 602 can take the classification estimation value output by the text classification model as the category of the text to be classified.
The text training system can classify the texts by adopting the text classification model trained on the basis of a small amount of labeled samples, can reduce the sample demand in the preparation process before classification is executed on the basis of ensuring the accuracy, and improves the training efficiency; under the condition of less labeled samples, the accuracy of text classification can be improved.
A schematic diagram of one embodiment of the disclosed data processing system is shown in fig. 7. The data processing system includes a memory 701 and a processor 702. Wherein: the memory 701 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is for storing instructions in the text classification model training method or the corresponding embodiment of the text classification method above. Processor 702 is coupled to memory 701 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 702 is configured to execute instructions stored in the memory, so that the sample requirement during preparation before classification can be reduced, and the training efficiency can be improved; under the condition of less labeled samples, the accuracy of text classification can be improved.
In one embodiment, as also shown in FIG. 8, data processing system 800 includes a memory 801 and a processor 802. The processor 802 is coupled to the memory 801 by a BUS 803. The data processing system 800 may also be coupled to external storage 805 via storage interface 804 to facilitate retrieval of external data, and may also be coupled to a network or another computer system (not shown) via network interface 806. And will not be described in detail herein.
In the embodiment, the data instruction is stored in the memory, and the instruction is processed by the processor, so that the sample demand in the preparation process before classification is executed can be reduced, and the training efficiency is improved; under the condition of less labeled samples, the accuracy of text classification can be improved.
In another embodiment, a computer-readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the steps of a text classification model training method or a method in a corresponding embodiment of a text classification method. As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Thus far, the present disclosure has been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
Finally, it should be noted that: the above examples are intended only to illustrate the technical solutions of the present disclosure and not to limit them; although the present disclosure has been described in detail with reference to preferred embodiments, those of ordinary skill in the art will understand that: modifications to the specific embodiments of the disclosure or equivalent substitutions for parts of the technical features may still be made; all such modifications are intended to be included within the scope of the claims of this disclosure without departing from the spirit thereof.
Claims (19)
1. A text classification model training method comprises the following steps: in each round of the training, the training is carried out,
determining a classification estimation value of each sample in an unlabeled sample set based on a text classification model to be trained, and acquiring an estimated labeled sample set;
obtaining a vector of the text of the labeled sample set and a vector of the text of the estimated labeled sample set by an encoder to be trained;
acquiring a mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identification of the labeled sample; acquiring a mixed sample set according to the mixed labeled sample set, the vector of the text of the sample of the estimated labeled sample set, the classification estimation value and the estimated mixing coefficient;
inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjusting parameters of the encoder to be trained, the text classification model to be trained and the feedforward neural network according to a loss value obtained based on a loss function;
and when the number of training rounds reaches the preset number, acquiring a text classification model.
2. The method of claim 1, further comprising:
before the sample set is input into the encoder to be trained, the labeled sample set is expanded according to the labeled sample until the sample amount in the labeled sample set is equal to the sample amount in the unlabeled sample set.
3. The method of claim 1, wherein each labeled sample in the set of labeled samples comprises text of an original sample, text of an enhanced sample of the original sample, and a category identification of the text of the original sample.
4. The method of claim 3, wherein,
the enhanced samples of the original sample comprise at least one of a first enhanced sample or a second enhanced sample;
the text of the first enhanced sample is generated by carrying out synonym replacement on the text of the original sample;
the text of the second enhanced sample is generated by translating the text of the original sample into a second language and then translating the second language back to the original language.
5. The method of claim 3 or 4, further comprising:
generating the labeled sample set in advance according to the original sample of the labeled category.
6. The method of claim 2, wherein,
the obtaining, by an encoder to be trained, a vector of text of the labeled sample set, and the estimating a vector of text of the labeled sample set includes: inputting texts of samples in the labeled sample set and the estimated labeled sample set into the encoder to be trained in batches by taking a preset batch size as a unit, and acquiring vectors of the texts of the labeled sample set and the estimated labeled sample set of each batch;
the obtaining a mixed labeled sample set and the obtaining a mixed sample set comprises: acquiring the mixed labeled sample set of each batch and a mixed sample set of a corresponding batch;
adjusting parameters of the encoder to be trained, the text classification model to be trained, and the feed-forward neural network includes: and respectively inputting the mixed labeled sample set and the mixed sample set of each batch into a feedforward neural network, and adjusting parameters of the encoder to be trained, the text classification model to be trained and the feedforward neural network according to a loss value obtained based on a loss function until the mixed labeled sample set and the mixed sample set of all batches in the current training round are processed.
7. The method of claim 6, further comprising:
after the labeled sample set is expanded, according to the preset batch size, sequentially extracting a text of an original sample, a text of an enhanced sample and a text of an estimated labeled sample in the expanded labeled sample set and the estimated labeled sample set respectively, wherein the labeled sample comprises the text of the original sample and the text of the enhanced sample of the original sample;
generating a text vector to be coded according to the text of the original sample, the text of the enhanced sample and the text of the estimated labeled sample, wherein the text vector to be coded comprises original sample dimensions, enhanced sample dimensions and estimated labeled sample dimensions, and the number of the sample texts in each dimension accords with the preset batch size;
cutting out the text in the text vector to be coded according to the upper limit of the preset text length;
the obtaining the vector of the text of the labeled sample set and the vector of the text of the estimated labeled sample set for each batch comprises:
inputting the cut text vectors to be coded into the coder to be trained, and acquiring the text coding vectors of the current batch;
extracting elements of the original sample dimension and the enhanced sample dimension in the text coding vector to obtain a vector of the labeled sample set;
and extracting elements of the dimensionality of the estimation labeling sample in the text coding vector to obtain the vector of the estimation labeling sample set.
8. The method of claim 1, wherein the obtaining a mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identifier of the labeled sample comprises:
acquiring mixed marked sample codes according to codes corresponding to original samples and codes corresponding to enhanced samples in the vectors of the texts of each marked sample of the marked sample set and enhanced mixed coefficients;
obtaining a vector of a mixed labeled sample set according to the mixed labeled sample code and the code corresponding to the original sample;
and acquiring the mixed labeled sample set according to the vector of the mixed labeled sample set and the corresponding category identification of the original sample.
9. The method of claim 8, wherein the obtaining mixed labeled sample encodings comprises:
and taking the enhanced mixed coefficient as the weight of the code corresponding to the enhanced sample, and adding the weight and the code corresponding to the original sample to obtain the mixed marked sample code.
10. The method of claim 1, wherein said obtaining a mixed sample set from the vectors of text of the samples of the mixed labeled sample set, the estimated labeled sample set, and the classification estimates, and estimated mixing coefficients comprises:
acquiring a coding estimation labeling sample set according to the vector of the text of the estimation labeling sample set and the category identification of the estimation labeling sample;
estimating, for each sample in a vector of an annotated sample set, for the encoding:
respectively and randomly extracting a sample in the mixed labeled sample set;
and respectively calculating the weighted sum of the vector of the text and the class identifier by taking a first estimation mixing coefficient as the weight of the samples in the coding estimation labeling sample set and a second estimation mixing coefficient as the weight of the extracted samples in the mixed labeled sample set, and acquiring the samples of the mixed sample set, wherein the first estimation mixing coefficient and the second estimation mixing coefficient are 1.
11. The method of claim 10, further comprising:
after each round of training is completed, the first estimated mixing coefficient is increased by a predetermined ratio.
12. The method of claim 1, wherein the inputting the mixed labeled sample set and the mixed sample set into a feed-forward neural network and adjusting parameters of the encoder to be trained, the text classification model to be trained, and the feed-forward neural network according to a loss value obtained based on a loss function comprises:
inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and outputting a processing result through a full connection layer;
and inputting the processing result into a loss function to obtain the loss value.
13. The method of claim 12, wherein the inputting the processing result into a loss function, the obtaining the loss value comprises:
acquiring cross entropy loss according to a processing result based on the mixed labeled sample set, and using the cross entropy loss as the first loss value;
obtaining a mean square error loss according to a processing result based on the mixed sample set, as the second loss value;
and acquiring a weighted value of the first loss value and the second loss value according to a preset loss value weight to serve as the loss value.
14. The method of claim 1, wherein the category identification comprises an emotion category identification.
15. A method of text classification, comprising:
inputting texts to be classified into a text classification model, wherein the text classification model is generated by training according to the text classification model training method of any one of claims 1-14;
and taking the classification estimation value output by the text classification model as the class of the text to be classified.
16. A text classification model training system, comprising:
the estimated sample set obtaining unit is configured to determine a classification estimation value of each sample in an unlabeled sample set based on a text classification model to be trained in each round of training, and obtain an estimated labeled sample set;
a vector obtaining unit configured to obtain, by an encoder to be trained, a vector of a text of a labeled sample set and a vector of the text of the estimated labeled sample set;
the mixing unit is configured to obtain a mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identification of the labeled sample; acquiring a mixed sample set according to the mixed labeled sample set, the vector of the text of the sample of the estimated labeled sample set, the classification estimation value and the estimated mixing coefficient;
a parameter adjusting unit configured to input the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjust parameters of the encoder to be trained, the text classification model to be trained, and the feedforward neural network according to a loss value obtained based on a loss function;
a model obtaining unit configured to obtain the text classification model when the number of training rounds reaches a predetermined number of times.
17. A text classification system comprising:
the text input unit is configured to input texts to be classified into a text classification model, wherein the text classification model is generated by training according to the text classification model training method of any one of claims 1-14;
and the class determining unit is configured to take the classification estimation value output by the text classification model as the class of the text to be classified.
18. A data processing system comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the method of any of claims 1-15 based on instructions stored in the memory.
19. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 15.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110494682.6A CN113177119B (en) | 2021-05-07 | 2021-05-07 | Text classification model training and classifying method and system and data processing system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110494682.6A CN113177119B (en) | 2021-05-07 | 2021-05-07 | Text classification model training and classifying method and system and data processing system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113177119A true CN113177119A (en) | 2021-07-27 |
CN113177119B CN113177119B (en) | 2024-02-02 |
Family
ID=76928276
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110494682.6A Active CN113177119B (en) | 2021-05-07 | 2021-05-07 | Text classification model training and classifying method and system and data processing system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113177119B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113642659A (en) * | 2021-08-19 | 2021-11-12 | 上海商汤科技开发有限公司 | Training sample set generation method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111522958A (en) * | 2020-05-28 | 2020-08-11 | 泰康保险集团股份有限公司 | Text classification method and device |
CN111723209A (en) * | 2020-06-28 | 2020-09-29 | 上海携旅信息技术有限公司 | Semi-supervised text classification model training method, text classification method, system, device and medium |
CN111966831A (en) * | 2020-08-18 | 2020-11-20 | 创新奇智(上海)科技有限公司 | Model training method, text classification device and network model |
CN112214605A (en) * | 2020-11-05 | 2021-01-12 | 腾讯科技(深圳)有限公司 | Text classification method and related device |
WO2021008037A1 (en) * | 2019-07-15 | 2021-01-21 | 平安科技(深圳)有限公司 | A-bilstm neural network-based text classification method, storage medium, and computer device |
-
2021
- 2021-05-07 CN CN202110494682.6A patent/CN113177119B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021008037A1 (en) * | 2019-07-15 | 2021-01-21 | 平安科技(深圳)有限公司 | A-bilstm neural network-based text classification method, storage medium, and computer device |
CN111522958A (en) * | 2020-05-28 | 2020-08-11 | 泰康保险集团股份有限公司 | Text classification method and device |
CN111723209A (en) * | 2020-06-28 | 2020-09-29 | 上海携旅信息技术有限公司 | Semi-supervised text classification model training method, text classification method, system, device and medium |
CN111966831A (en) * | 2020-08-18 | 2020-11-20 | 创新奇智(上海)科技有限公司 | Model training method, text classification device and network model |
CN112214605A (en) * | 2020-11-05 | 2021-01-12 | 腾讯科技(深圳)有限公司 | Text classification method and related device |
Non-Patent Citations (1)
Title |
---|
宋建国;: "基于半监督与词向量加权的文本分类研究", 软件导刊, no. 09 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113642659A (en) * | 2021-08-19 | 2021-11-12 | 上海商汤科技开发有限公司 | Training sample set generation method and device, electronic equipment and storage medium |
WO2023019908A1 (en) * | 2021-08-19 | 2023-02-23 | 上海商汤智能科技有限公司 | Method and apparatus for generating training sample set, and electronic device, storage medium and program |
Also Published As
Publication number | Publication date |
---|---|
CN113177119B (en) | 2024-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110348214B (en) | Method and system for detecting malicious codes | |
CN110362819B (en) | Text emotion analysis method based on convolutional neural network | |
US20170308526A1 (en) | Compcuter Implemented machine translation apparatus and machine translation method | |
CN111339305A (en) | Text classification method and device, electronic equipment and storage medium | |
CN111680494A (en) | Similar text generation method and device | |
CN110276071A (en) | A kind of text matching technique, device, computer equipment and storage medium | |
CN111506709B (en) | Entity linking method and device, electronic equipment and storage medium | |
CN110059183A (en) | A kind of automobile industry User Perspective sensibility classification method based on big data | |
CN113408287B (en) | Entity identification method and device, electronic equipment and storage medium | |
CN112329482A (en) | Machine translation method, device, electronic equipment and readable storage medium | |
CN113177119A (en) | Text classification model training and classifying method and system and data processing system | |
CN111680529A (en) | Machine translation algorithm and device based on layer aggregation | |
CN115080750A (en) | Weak supervision text classification method, system and device based on fusion prompt sequence | |
CN113553847A (en) | Method, device, system and storage medium for parsing address text | |
CN112906403B (en) | Semantic analysis model training method and device, terminal equipment and storage medium | |
CN111325033A (en) | Entity identification method, entity identification device, electronic equipment and computer readable storage medium | |
Latif et al. | Can large language models aid in annotating speech emotional data? uncovering new frontiers | |
CN109753646B (en) | Article attribute identification method and electronic equipment | |
CN116186562B (en) | Encoder-based long text matching method | |
CN108475265B (en) | Method and device for acquiring unknown words | |
CN112749530B (en) | Text encoding method, apparatus, device and computer readable storage medium | |
CN113157914B (en) | Document abstract extraction method and system based on multilayer recurrent neural network | |
CN113935387A (en) | Text similarity determination method and device and computer readable storage medium | |
CN117271778B (en) | Insurance outbound session information output method and device based on generation type large model | |
CN116227496B (en) | Deep learning-based electric public opinion entity relation extraction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |