CN113177119A - Text classification model training and classifying method and system and data processing system - Google Patents

Text classification model training and classifying method and system and data processing system Download PDF

Info

Publication number
CN113177119A
CN113177119A CN202110494682.6A CN202110494682A CN113177119A CN 113177119 A CN113177119 A CN 113177119A CN 202110494682 A CN202110494682 A CN 202110494682A CN 113177119 A CN113177119 A CN 113177119A
Authority
CN
China
Prior art keywords
text
sample set
sample
labeled
mixed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110494682.6A
Other languages
Chinese (zh)
Other versions
CN113177119B (en
Inventor
陈龙
李宥壑
肖小范
周伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Wodong Tianjun Information Technology Co Ltd
Priority to CN202110494682.6A priority Critical patent/CN113177119B/en
Publication of CN113177119A publication Critical patent/CN113177119A/en
Application granted granted Critical
Publication of CN113177119B publication Critical patent/CN113177119B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a text classification model training and classifying method and system and a data processing system, and relates to the technical field of data processing. The text classification model training method comprises the following steps: in each round of training, determining a classification estimation value of each sample in an unlabeled sample set based on a text classification model to be trained, and acquiring an estimated labeled sample set; processing the labeled sample set and estimating a labeled sample set through an encoder to be trained; acquiring a mixed sample set according to the processed labeled sample set, the estimated labeled sample set and the estimated mixing coefficient; inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjusting parameters of a text classification model to be trained according to a loss function; and when the number of training rounds reaches the preset number, acquiring a text classification model. By the method, the manual labeling requirement can be reduced, and the model training efficiency is improved.

Description

Text classification model training and classifying method and system and data processing system
Technical Field
The disclosure relates to the technical field of data processing, in particular to a text classification model training and classification method and system and a data processing system.
Background
User reviews are one of the basic functions of many internet websites. According to the comment content of the user, the feedback opinions of the user can be conveniently obtained and adjusted.
Because the input amount of user comments is huge, the manual discrimination mode is difficult to support, and therefore the discrimination efficiency needs to be improved by means of the automatic machine recognition mode. In the identification process, the evaluation contents are generally classified into good, bad and medium.
In the related art, the category of the comment can be identified by setting a predetermined rule, and for example, a comment containing a word such as "spam", "bad", or the like is determined as a bad comment.
In addition, a machine learning or deep learning algorithm (e.g., RNN (Recurrent Neural Network)/LSTM ((Long Short Term Memory/Long Short Term Memory)/BiLSTM (Bi-directional Long Short Term Memory/Bi-directional Short Term Memory) or the like) may be introduced, and after a large number of samples are input to the model and trained by means of supervised learning, general good comments or bad comments can be recognized, or a Pre-training language model (e.g., Word2 Vector/BERT (Word to Vector)/BERT (Bidirectional Encoder from Transformers, Bidirectional encoding characterization based on Transformers)/GPT (general purpose Pre-transmitted Transformer, general Pre-training Transformer) or the like) may be introduced to bring semantics of the large-scale corpus and then input a large number of samples to perform Fine tuning (Fine tuning), a final model that can identify good/bad scores is obtained.
Disclosure of Invention
One purpose of the present disclosure is to reduce the amount of data required in the text classification model training process on the basis of ensuring the accuracy of text classification.
According to an aspect of some embodiments of the present disclosure, a text classification model training method is provided, including: in each round of the training, the training is carried out,
determining a classification estimation value of each sample in an unlabeled sample set based on a text classification model to be trained, and acquiring an estimated labeled sample set; obtaining a vector of the text marked with the sample set through an encoder to be trained, and estimating the vector of the text marked with the sample set; acquiring a mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identification of the labeled sample; acquiring a mixed sample set according to the mixed labeled sample set, the vector and the classification estimation value of the text of the sample of the estimated labeled sample set and the estimated mixing coefficient; inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjusting parameters of an encoder to be trained, a text classification model to be trained and the feedforward neural network according to a loss value obtained based on a loss function; and when the number of training rounds reaches the preset number, acquiring a text classification model.
In some embodiments, the text classification model training method further comprises: before the sample set is input into the encoder to be trained, the labeled sample set is expanded according to the labeled samples until the sample amount of the labeled sample set is equal to that of the unlabeled sample set.
In some embodiments, each labeled sample in the set of labeled samples includes text of an original sample, text of an enhanced sample of an original sample, and a category identification of the original sample.
In some embodiments, the enhanced samples of the original sample comprise at least one of a first enhanced sample or a second enhanced sample; the text of the first enhanced sample is generated by carrying out synonym replacement on the text of the original sample; the text of the second enhanced sample is generated after the text of the original sample is translated into the second language and then translated back into the original language.
In some embodiments, the text classification model training method further comprises: and generating a labeled sample set in advance according to the original sample of the labeled category.
In some embodiments, obtaining, by an encoder to be trained, a vector of text of a labeled sample set, and estimating the vector of text of the labeled sample set comprises: inputting texts of samples in the labeled sample set and the estimated labeled sample set into an encoder to be trained in batches by taking the size of a preset batch as a unit, and acquiring vectors of the texts of the labeled sample set and the estimated labeled sample set of each batch; obtaining the mixed annotated sample set and obtaining the mixed sample set comprises: acquiring a mixed labeled sample set of each batch and a mixed sample set of a corresponding batch; the parameters for adjusting the encoder to be trained, the text classification model to be trained, and the feed-forward neural network include: and respectively inputting the mixed labeled sample set and the mixed sample set of each batch into a feedforward neural network, and adjusting parameters of an encoder to be trained, a text classification model to be trained and the feedforward neural network according to a loss value obtained based on a loss function until the mixed labeled sample set and the mixed sample set of all batches in the current training round are processed.
In some embodiments, the text classification model training method further comprises: after the marked sample set is expanded, according to a preset batch size, sequentially extracting a text of an original sample, a text of an enhanced sample and a text of an estimated marked sample in the expanded marked sample set and the estimated marked sample set respectively, wherein the marked sample comprises the text of the original sample and the text of the enhanced sample of an original sample; generating a text vector to be coded according to the text of the original sample, the text of the enhanced sample and the text of the estimated marked sample, wherein the text vector to be coded comprises original sample dimensionality, enhanced sample dimensionality and estimated marked sample dimensionality, and the number of sample texts in each dimensionality accords with a preset batch size; cutting a text in a text vector to be coded according to a preset text length upper limit; obtaining the vector of the text of the labeled sample set of each batch and estimating the vector of the text of the labeled sample set comprises: inputting the cut text vectors to be coded into a coder to be trained, and acquiring the text coding vectors of the current batch; extracting elements of original sample dimensionality and enhanced sample dimensionality in the text coding vector to obtain a vector of a labeled sample set; and extracting elements of the dimension of the estimation labeling sample in the text coding vector to obtain the vector of the estimation labeling sample set.
In some embodiments, obtaining the mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identifier of the labeled sample includes: acquiring mixed marked sample codes according to the codes corresponding to the original sample and the enhanced sample in the vector of the text of each marked sample of the marked sample set and the enhanced mixed coefficient; obtaining a vector of the mixed labeled sample set according to the mixed labeled sample code and the code corresponding to the original sample; and acquiring the mixed labeled sample set according to the vector of the mixed labeled sample set and the corresponding category identification of the original sample.
In some embodiments, obtaining the mixed annotated sample encoding comprises: and taking the enhanced mixed coefficient as the weight of the code corresponding to the enhanced sample, and adding the weight and the code corresponding to the original sample to obtain the mixed marked sample code.
In some embodiments, obtaining the mixed sample set from mixing the labeled sample set, estimating a vector of text and a classification estimation value for the samples of the labeled sample set, and estimating a mixing coefficient comprises: acquiring a coding estimation labeling sample set according to the vector of the text of the estimation labeling sample set and the category identification of the estimation labeling sample; estimating, for each sample in a vector of the set of annotated samples, for the encoding: respectively extracting and mixing one sample in the marked sample set at random; and respectively calculating the weighted sum of the vector of the text and the class identifier by taking the first estimation mixing coefficient as the weight of the samples in the coding estimation labeling sample set and the second estimation mixing coefficient as the weight of the extracted mixed labeled samples in the sample set, and acquiring the samples of the mixed sample set, wherein the first estimation mixing coefficient and the second estimation mixing coefficient are 1.
In some embodiments, the text classification model training method further comprises: after each round of training is completed, the first estimated mixing coefficient is increased by a predetermined ratio.
In some embodiments, inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjusting parameters of an encoder to be trained, a text classification model to be trained, and the feedforward neural network according to a loss value obtained based on a loss function includes: inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and outputting a processing result through a full connection layer; and inputting the processing result into a loss function to obtain a loss value.
In some embodiments, inputting the processing result into a loss function, and obtaining the loss value comprises: acquiring cross entropy loss as a first loss value according to a processing result based on the mixed labeled sample set; acquiring a mean square error loss as a second loss value according to a processing result based on the mixed sample set; and acquiring a weighted value of the first loss value and the second loss value according to the preset loss value weight to serve as the loss value.
In some embodiments, the category identification comprises an emotion category identification.
According to an aspect of some embodiments of the present disclosure, there is provided a text classification method, including: inputting a text to be classified into a text classification model, wherein the text classification model is generated by training according to any one of the text classification model training methods mentioned above; and taking the classification estimation value output by the text classification model as the class of the text to be classified.
According to an aspect of some embodiments of the present disclosure, there is provided a text classification model training system, including: the estimated sample set obtaining unit is configured to determine a classification estimation value of each sample in an unlabeled sample set based on a text classification model to be trained in each round of training, and obtain an estimated labeled sample set; a vector acquisition unit configured to acquire, by an encoder to be trained, a vector of the text of the labeled sample set, and to estimate the vector of the text of the labeled sample set; the mixing unit is configured to obtain a mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identification of the labeled sample; acquiring a mixed sample set according to the mixed labeled sample set, the vector and the classification estimation value of the text of the sample of the estimated labeled sample set and the estimated mixing coefficient; the parameter adjusting unit is configured to input the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjust parameters of an encoder to be trained, a text classification model to be trained and the feedforward neural network according to a loss value obtained based on a loss function; a model obtaining unit configured to obtain the text classification model when the number of training rounds reaches a predetermined number of times.
According to an aspect of some embodiments of the present disclosure, there is provided a text classification system including: the text input unit is configured to input texts to be classified into a text classification model, wherein the text classification model is generated by training according to any one of the text classification model training methods; and the class determination unit is configured to take the classification estimation value output by the text classification model as the class of the text to be classified.
According to an aspect of some embodiments of the present disclosure, there is provided a data processing system, comprising: a memory; and a processor coupled to the memory, the processor configured to perform any of the methods mentioned above based on instructions stored in the memory.
According to an aspect of some embodiments of the present disclosure, a computer-readable storage medium is proposed, on which computer program instructions are stored, which instructions, when executed by a processor, implement the steps of any one of the methods mentioned above.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:
fig. 1 is a flow diagram of some embodiments of a text classification model training method of the present disclosure.
FIG. 2 is a flow diagram of some embodiments of a single batch per round of data processing in a text classification model training method of the present disclosure.
FIG. 3 is a flow diagram of some embodiments of parameter adjustment in a text classification model training method of the present disclosure.
Fig. 4 is a flow diagram of some embodiments of a text classification method of the present disclosure.
FIG. 5 is a schematic diagram of some embodiments of a text classification model training system of the present disclosure.
Fig. 6 is a schematic diagram of some embodiments of a text classification system of the present disclosure.
FIG. 7 is a schematic diagram of some embodiments of data processing systems of the present disclosure.
FIG. 8 is a schematic diagram of further embodiments of data processing systems according to the present disclosure.
Detailed Description
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
A flow diagram of some embodiments of a text classification model training method of the present disclosure is shown in fig. 1.
In step 101, in the current round of training, based on the text classification model to be trained, the classification estimation value y of each sample u in the unlabeled sample set is determineduObtaining an estimation marking sample set, wherein each sample in the estimation marking sample set is (u, y)u). The number of samples in the unlabeled sample set, that is, the samples of the class to which the unlabeled samples belong, may be much larger than the number of samples in the labeled sample set.
The category can be an emotion category, and the identification comprises an emotion category identification, for example, the bad score is 0, the medium score is 1, and the good score is 2; or a good score of 2, a medium score of 1, a poor score of 0, etc. The classification estimation value is a class identifier corresponding to a sample estimated by the text classification model to be trained, and for example, the estimation value is any one of 0,1 and 2.
In some embodiments, the text classification model to be trained is a machine learning model, and a public Chinese emotion analysis data set can be used as a transfer learning training sample in advance for training, for example, data in public Chinese microblog emotion analysis is used for training, so that universal basic training is realized, the number of rounds required by subsequent training is reduced, and the training efficiency is improved.
In step 102, a vector of the text of the labeled sample set is obtained by the encoder to be trained, and the vector of the text of the labeled sample set is estimated. In some embodiments, the labeled sample set may be generated by manually labeling the extracted original sample.
In some embodiments, the labeled sample set may be a sample set generated after performing an enhancement operation on the labeled original sample. In some embodiments, the enhancement operation may include synonym substitution. In some embodiments, the enhancement operation may include translating the text of the sample into a second language, such as English, and then back into the original language.
In some embodiments, each labeled sample in the set of labeled samples includes text of an original sample, text of an enhanced sample of an original sample, and a category identification of the original sample. For example, the original sample has text s and the enhanced sample has s1The category of the original sample is marked as ysThen the samples in the labeled sample set are (s, s)1,ys). In some embodiments, each labeled sample may include text of two enhanced samples, which are generated by different enhancement operations, for example, the sample in the labeled sample set is (s, s)1,s2,ys)。
In some embodiments, the labeled sample set may be randomly shuffled, for example, random shuffle may be performed by using a tensoflow function of tensoflow, so as to increase randomness of the order of the samples in the labeled sample set and reduce training bias.
In step 103, a mixed labeled sample set is obtained according to the vector of the text of each labeled sample in the labeled sample set and the category identifier of each labeled sample.
In some embodiments, the set of mixed labeled samples may be a set that includes a vector of text of the original sample and a category identification of the original sample, and a vector of text of the enhanced sample and a category identification of the enhanced sample.
In some embodiments, the text vectors in the labeled sample set may be blended with each other to generate text that blends the samples in the labeled sample set. In some embodiments, the blending of the set of labeled samples may include a text vector of an original sample in the labeled sample, and a weighted sum of the text vector of the original sample and a text vector of an enhanced sample corresponding to the original sample, where the enhanced blending coefficient is a weight of the text vector of the enhanced sample.
In step 104, a mixed sample set is obtained according to the mixed labeled sample set, the vector and the estimated mixing coefficient of the text of the estimated labeled sample set, and the classification estimation value. In some embodiments, the estimated mixture coefficients may include a first estimated mixture coefficient and a second estimated mixture coefficient as weights for mixing the samples in the labeled sample set and the labeled sample set, respectively, and a weighted sum of the text vector and the category identification is obtained as a weighted sum of the samples in the mixed sample set, respectively. In some embodiments, samples may be randomly drawn from the mixed set of labeled samples, mixed with a vector of text that estimates the set of labeled samples, and a category identification.
In step 105, the mixed labeled sample set and the mixed sample set are input into a feedforward neural network, and parameters of an encoder to be trained, a text classification model to be trained, and the feedforward neural network are adjusted according to a loss value obtained based on a loss function.
In step 106, it is determined whether the number of training rounds has reached a predetermined number of times. In some embodiments, the number of training rounds may be set to N (N is a preset positive integer), and the current round is identified as i. When i is less than or equal to N, i +1 is executed, and a batch of samples is obtained again, and step 101 is executed. If i > N, step 107 is performed.
In step 107, the training of the text classification model to be trained is completed, and a trained text classification model is obtained. In some embodiments, in the 1 st round of training, the text classification model to be trained is marked as model M0(ii) a In the training of the ith round, the text classification model to be trained is marked as a model Mi-1Then at the completion of training, the model is MN。MNI.e. the required text classification model.
By the method, the cyclic training of the text classification model can be realized through the mixed analysis of a small amount of labeled samples and unlabeled samples, and the requirement on the number of labeled samples required in the training process of the text classification model is reduced on the basis of ensuring the accuracy of the trained text classification model, so that the manual labeling requirement is reduced, and the model training efficiency is improved; in the absence of labeled samples, the accuracy of the text classification model is improved.
In some embodiments, before starting the training of the current round as shown in step 101, a labeled sample set S' including the text of the original sample, the text of the enhanced sample and the analog identification may be generated in advance from the sample set S including only the original sample and the category identification thereof.
In some embodiments, synonym substitutions are made for the original text. For example, the method replaces the 'use at present, large capacity, random download and storage of information and quick charging' with the 'use at present, large capacity, random download and storage of data and quick charging'. In some embodiments, synonym replacement may be performed on the original text using a published dictionary of chinese synonyms, such as "forest of hayawara synonyms," resulting in an enhanced sample text.
In some embodiments, the original text may be translated twice, from the original language to the second language, and then translated back to the original language, with the translated back text being used as the text of the enhanced sample. For example, translate Chinese to English and then back to Chinese. For example, "things are good, cost performance is high, express delivery is fast" is translated into "It's good, cost-effective and fast delivery", and then translated back into Chinese "good, cost performance is high, delivery is fast". In some embodiments, any machine translation engine may be employed, for example using an Apertium translation engine.
In some embodiments, the sample enhancement operation may be performed in the above two ways at the same time, and the text s of the first enhanced sample is generated separately1And text s of a second enhancement sample2Generating samples (S, S) in the labeled sample set S1,s2,ys)。
By the method, the marked sample amount can be expanded by a mode of enhancing sample operation with less operation amount, the demand of manually marking sample types is reduced, and the model training efficiency is improved.
In some embodiments, between the above steps 101 and 102, the labeled sample set may be expanded according to the labeled sample until the amount of samples in the labeled sample set is equal to the amount of samples in the unlabeled sample set. In some embodiments, the expansion method may be to sequentially take samples from the labeled sample set and append the samples to the tail of the labeled sample set until the amount of samples in the labeled sample set is equal to the amount of samples in the unlabeled sample set.
By the method, the utilization rate of limited marked samples can be improved, and the convenience and reliability of subsequent data processing can be improved.
In some embodiments, samples may be batched into the encoder to be trained in each pass, reducing the data processing burden on the encoder. A flow diagram of some embodiments of a single batch data processing per round in a text classification model training method of the present disclosure is shown in fig. 2.
In step 201, according to the predetermined batch size b, the text of the original sample, the text of the enhanced sample and the text of the estimated labeled sample are sequentially extracted from the expanded labeled sample set and the estimated labeled sample set, respectively. The batch size refers to the number of samples in a single batch and may also be referred to as the batch size. In some embodiments, Sample extraction may be performed using Batch sampling (Batch Sample). In some embodiments, before the text is extracted in batches, the labeled sample set is expanded according to the labeled samples until the sample amount of the labeled sample set is equal to the sample amount of the unlabeled sample set, so that synchronous extraction and synchronous extraction ending of the labeled sample set and the unlabeled sample set can be ensured, and the probability of data processing failure is reduced. Suppose that the batch of samples taken from the labeled sample set is (
Figure BDA0003053788760000091
YS) The sample of the batch taken from U' is (A)
Figure BDA0003053788760000101
YU) Wherein
Figure BDA0003053788760000102
Sample texts that are all one batch, i.e. they all comprise b texts; y isS、YUThe labeled and estimated b categories are identified (0/1/2), respectively.
In step 202, the text s of the extracted original text and the text of the enhancement sample (s is taken as an example if the first enhancement sample and the second enhancement sample exist)1,s2) And estimating the text u of the labeled sample to generate a text T to be coded. In some embodiments, the number of each text conforms to a predetermined batch size b (in some embodiments, b may be 16), and a text vector of the original sample is extracted
Figure BDA0003053788760000103
Text of enhanced samples
Figure BDA0003053788760000104
And estimating text of the annotated sample
Figure BDA0003053788760000105
Splicing the text vectors (for example, adopting a concat function of Tensorflow to synthesize the text vectors into a vector in a mode of increasing vector dimensions), and generating the text vectors to be coded of the current batch
Figure BDA0003053788760000106
Figure BDA0003053788760000107
Is 4 b.
In step 203, the text vectors to be encoded of the current batch are processed
Figure BDA0003053788760000108
Inputting into a coder to be trained, and obtaining the vector of the text of the labeled sample set, including the vector X of the text of the original sample and the vector of the text of the enhanced sample, such as X1、X2The vector X of the text of the estimated labeled sample set can also be obtainedU. In some embodiments, the encoder to be trained may be a BERT encoder to be trained.
In some embodiments, when a text vector is to be encoded
Figure BDA0003053788760000109
After passing through the encoder to be trained, vectors, such as (X, X), respectively formed by the encoding results of the original sample text, the vector of the text of the enhanced sample, and the vector of the text of the estimated labeled sample can be generated1,X2,XU) Splitting the original sample into an original sample dimension X and an enhanced sample dimension X according to the dimensions1、X2And estimating the dimension X of the labeled sampleU
In some embodiments, the text vector to be encoded
Figure BDA00030537887600001010
Before the input of the encoder to be trained, preprocessing can also be performed, such as clipping the text in the text vector to be encoded according to a predetermined text length upper limit. To be provided with
Figure BDA00030537887600001011
The method includes the following steps that 4 × b texts are taken as an example, and cutting is performed, namely, all parts of each text with the length exceeding L are discarded (assuming that L is 512); then will be
Figure BDA00030537887600001012
Input to the encoder as a small 64 sample batch. For each sample of text, the encoder outputs the encoding result. In some embodiments, the encoder is arranged to encode each text as a d-dimensional vector, for example, let d 768.
In step 204, a mixed labeled sample set and a mixed sample set of the current batch are obtained.
In some embodiments, the mixed labeled sample set may be obtained according to the vector and the enhanced mixing coefficient of the text of each labeled sample in the labeled sample set, and the category identifier of each sample. In some embodiments, the mixing of the text vectors in the labeled sample set may include the text vector of the original sample in the labeled sample, and a weighted sum of the text vector of the original sample and the text vector of the enhanced sample corresponding to the original sample, where the enhanced mixing coefficient is a weight of the text vector of the enhanced sample.
In some embodiments, vector X corresponding to the original sample and vector X corresponding to the enhancement sample may be based on the original sample1、X2And enhancing the blending coefficient mu to obtain a blended labeled sample code X1’、X2' for example, the enhanced mixture coefficient is added to the vector corresponding to the original sample as the weight of the vector corresponding to the enhanced sample to obtain the mixed labeled sample vector corresponding to the entry, that is, the vector
X1’=X+μX1
X2’=X+μX2
And further, obtaining the vector (X, X) of the mixed marked sample set according to the mixed marked sample vector and the vector corresponding to the original sample1’、X2') to a host; combining the vector of the mixed labeled sample set with the corresponding category identification Y of the original sample to obtain a mixed labeled sample set ((X, Y)s),(X1’,Ys),(X2’,Ys)). After such an operation, the number of samples in the labeled sample set is mixed to 3b, which is abbreviated as (X)s’,Ys) Wherein the vector is (x)s’,ys). In some embodiments, for each labeled sample in the labeled sample set, in the vector of text for that labeled sample, the code x corresponding to the original sample and the code x corresponding to the enhanced sample are modified1、x2And enhancing the blending coefficient mu to obtain a blended labeled sample code x1’、x2', such as x1’=x+μx1,x2’=x+μx2(ii) a Obtaining a mixed marked sample set, wherein three samples exist in the same corresponding original sample, namely (x, y)s),(x1’,ys) And (x)2’,ys)。
The inventor finds that in the sample enhancement process, reasonable classification of the text of the enhanced sample and the original sample can be kept consistent in most cases, but the enhanced sample can be distorted or even overturned in individual cases. For example, the comment of a certain charger is 'too amazing', after the charging is carried out for a lot of 'It' samazing. Such enhanced samples can adversely affect subsequent analysis.
Through the method of the embodiment, the operation of semantic distortion and even turning which may occur in the sample enhancement process can be overcome, the parameter mu is a variable super parameter, and the larger the mu is, the stronger the interference of the enhancement sample on the model effect is; the smaller μ, the weaker the sample enhancement. In some embodiments, μmay be set to 0.5, so as to reduce the negative impact caused by semantic torsion that may occur on the basis of sample expansion using the enhanced sample, and improve the accuracy of the trained model.
In some embodiments, on the basis of obtaining the mixed labeled sample set by any one of the above manners, the samples in the mixed labeled sample set may be further mixed with the vector of the text of the estimated labeled sample set and the category identification of the estimated labeled sample thereof to obtain a mixed sample set.
In some embodiments, the vector X of text labeling the sample set is estimated based on the vector XUAnd estimating class identity Y of the annotated sampleUObtaining a set of code estimation labeling samples (X)U,YU) Estimating the annotated sample set (X)U,YU) Including b samples. The estimate further annotates each sample (x) in the vector of the sample set for the encodingU,yU) Separately, the labeled sample sets (X) are randomly extracted and mixeds’,Ys) One sample (x)s’,ys),(Xs’,Ys) Including 3b samples. Respectively calculating the weighted sum of the vector of the text and the class identifier by taking the first estimated mixing coefficient lambda as the weight of the samples in the coding estimation labeled sample set and the second estimated mixing coefficient (1-lambda) as the weight of the extracted mixed labeled sample set, namely according to the formula:
xU’=λxU+(1-λ)xs
yU’=λyU+(1-λ)ys
obtaining a mixed sampleSample of set (x)U’,yU'). Through pair (X)U,YU) Mixing each sample to obtain mixed sample set (X)U’,YU’)。
In some embodiments, the first estimated mixture coefficient λ is a variable hyperparameter, and as λ is larger, the mixed samples approach the estimated labeled samples more closely; as λ gets smaller, the mixed sample gets closer to the labeled sample. In some embodiments, λ may be set manually empirically.
In other embodiments, λ may be a dynamic value, and the first estimated mixing coefficient is increased by a predetermined ratio after each round of training is completed. For example, in an early stage of i-small training, such as the first and second rounds of training, if the accuracy of the model is low and the accuracy of the estimated labeled sample is poor, λ is set to be small, so that the mixed sample is closer to the labeled sample; and increasing the accuracy of the text classification model gradually along with the increase of the i, and increasing the lambda so that the estimation labeling sample gradually plays a role and the generalization capability of the model is enhanced. In some embodiments, the λ value of the ith round may be set to λiThen there is λi=λ(i-1)Eta, e.g. specifying lambda1=0.1,η=1.1。
By the method, excessive interference on training caused by samples with poor accuracy can be avoided by estimating the adjustment of the mixing coefficient, the generalization capability is gradually improved, and the training efficiency is improved.
Through the operations in step 204, a hybrid labeled sample set (X) is obtaineds’,Ys) And mixed sample set (X)U’,YU') of which one or more,
Figure BDA0003053788760000131
and Y isSThe value of (a) is one of integers 0,1 and 2;
Figure BDA0003053788760000132
wherein Y isU' real numbers with values of 0 to 2.
In step 205, the mixed labeled sample set and the mixed sample set are input into a feedforward neural network, and parameters of an encoder to be trained, a text classification model to be trained, and the feedforward neural network are adjusted according to a loss value obtained based on a loss function.
In some embodiments, let (X)S′,YS) The output is (X) after passing through a feedforward neural networkS+′,YS) Input (X)U′,YU') post output is (X)U+′,YY'). To be output (X)S+′,YS) And (X)U+′,YU') inputting the loss function to obtain a loss value, reversely propagating the loss value, and adjusting parameters of the encoder to be trained, the text classification model to be trained and the feedforward neural network.
In step 206, it is determined whether all samples in the labeled sample set and the estimated labeled sample set have been extracted. If all samples in the labeled sample set and the estimated labeled sample set are extracted, ending the training process of the current round; otherwise, step 201 is executed to extract the text of the subsequent sample in the sample set, continuing the text extraction progress in the previous cycle.
By the method, sample texts can be extracted in batches for processing, and sample data can be fully utilized through multi-batch processing; the data volume of each time of processing is reduced, the operation burden of each link is reduced, and the reliability and the efficiency of training are improved.
A flow diagram of some embodiments of a parameter adjustment portion of the text classification model training method of the present disclosure is shown in fig. 3. In some embodiments, the parameter adjustments shown below may be a detailed expansion of step 205 above.
In step 301, the mixed labeled sample set and the mixed sample set are input into a feedforward neural network, and a processing result is output through a full connection layer.
In some embodiments, assuming the sample feature is x, after entering the feedforward neural network, first go through a full join layer: y is1=Relu(ω1x+b1) Wherein
Figure BDA0003053788760000133
Figure BDA0003053788760000141
A reciu (Rectified Linear Unit, also called a modified Linear Unit) is a commonly used activation function in an artificial neural network;
Figure BDA0003053788760000142
is a real number matrix. In some embodiments, let D2048; then passing through a full connecting layer y2=ω2y1+b2Wherein
Figure BDA0003053788760000143
The feedforward neural network can strengthen the nonlinear characteristic and enhance the adaptation capability of the BERT after the transfer learning to the actual task. Is provided with (X)S′,YS) The output is (X) after passing through a feedforward neural networkS+′,YS) Input (X)U′,YU') post output is (X)U+′,YU′)。
Step 302 and step 303 may subsequently be performed in parallel.
In step 302, cross-entropy losses are obtained as first loss values from the processing results based on the mixed set of labeled samples. In some embodiments, the feedforward neural network may be input to the Softmax layer based on the processing result of mixing the labeled sample sets, and the cross-entropy LOSS is obtained as the first LOSS value LOSSS
In step 303, a mean square error loss is obtained as a second loss value from the processing result based on the mixed sample set. In some embodiments, the processing result based on the mixed sample set may be input to a linear regression layer to obtain a mean square error LOSS as a second LOSS value LOSSU
Due to YU' generated by the calculation in step 204, the value of each term may be a non-integer, for example, Y is given by category labels 0,1, and 2UEach term in' can take on real numbers between 0 and 2, thus (X)U+,YU') is defined by the mean square error, ensuring that Y can be matchedUThe amount of information in' is handled efficiently.
In step 304, a weighted value of the first loss value and the second loss value is obtained as a loss value according to a predetermined loss value weight. In some embodiments, LOSS is obtainedSAnd LOSSUThen, based on the formula:
LOSS=δ*LOSSU+(1-δ)*LOSSS
obtaining a LOSS value LOSS, wherein delta is LOSSUWeight of (1-delta) is LOSSSThe weight of (c). In some embodiments, δ is a hyperparameter, which determines the impact of unlabeled samples in the model training, and may be set empirically or adjusted during use based on the effect. In some embodiments, δ may take the value 0.25.
In step 305, the LOSS value LOSS is propagated backward, and parameters of the encoder to be trained, the text classification model to be trained, and the feedforward neural network are adjusted.
By the method, the problem that the class identifiers of the samples in the mixed sample set are non-integer can be considered, and the class identifiers are effectively utilized by selecting a proper loss function; in addition, the influence of the unlabeled sample in the training process can be flexibly adjusted through setting delta, so that a user can freely adjust the influence according to the requirements of efficiency and accuracy, and the controllability is improved.
A flow diagram of some embodiments of a text classification method of the present disclosure is shown in fig. 4.
In step 401, the text to be classified is input into a text classification model. Text classification model MNGenerated by a training method of any one of the text classification models.
In step 402, the classification estimation value output by the text classification model is taken as the category of the text to be classified. In some embodiments, the category may be an emotion category, and the identification includes an emotion category identification, such as a bad rating of 0, a medium rating of 1, and a good rating of 2; or a good score of 2, a medium score of 1, a poor score of 0, etc.
By the method, the text classification is carried out by adopting the text classification model trained on the basis of a small amount of labeled samples, so that the sample demand in the preparation process before classification is reduced on the basis of ensuring the accuracy, and the training efficiency is improved; under the condition of less labeled samples, the accuracy of text classification can be improved.
A schematic diagram of some embodiments of the text classification model training system of the present disclosure is shown in fig. 5.
The estimated sample set obtaining unit 501 can determine the classification estimation value y of each sample u in the unlabeled sample set based on the text classification model to be trained in each round of traininguObtaining an estimation marking sample set, wherein each sample in the estimation marking sample set is (u, y)u). The number of samples in the unlabeled sample set, that is, the samples of the class to which the unlabeled samples belong, may be much larger than the number of samples in the labeled sample set.
The vector acquisition unit 502 can acquire a vector of the text of the labeled sample set by the encoder to be trained, and estimate the vector of the text of the labeled sample set. In some embodiments, the labeled sample set may be generated by manually labeling the extracted original sample. In some embodiments, the labeled sample set may be a sample set generated after performing an enhancement operation on the labeled original sample. In some embodiments, the enhancement operation may include synonym substitution. In some embodiments, the enhancement operation may include translating the text of the sample into a second language, such as English, and then back into the original language.
A mixing unit 503 capable of generating a mixed labeled sample set and a mixed sample set. In some embodiments, the mixed labeled sample set is obtained according to the vector of the text of each labeled sample in the labeled sample set and the category identifier of each labeled sample. In some embodiments, the set of mixed labeled samples may be a set that includes a vector of text of the original sample and a category identification of the original sample, and a vector of text of the enhanced sample and a category identification of the enhanced sample. And further, acquiring a mixed sample set according to the mixed labeled sample set, the vector and the estimated mixing coefficient of the text of the estimated labeled sample set and the classification estimation value. .
The parameter adjusting unit 504 can input the mixed labeled sample set and the mixed sample set into the feedforward neural network, and adjust parameters of the encoder to be trained, the text classification model to be trained, and the feedforward neural network according to a loss value obtained based on the loss function.
The model obtaining unit 505 can obtain the trained text classification model in the case that the number of training rounds reaches the predetermined number of times. In some embodiments, in the 1 st round of training, the text classification model to be trained is marked as model M0(ii) a In the training of the ith round, the text classification model to be trained is marked as a model Mi-1Then at the completion of training, the model is MN。MNI.e. the required text classification model.
The text classification model training system can realize the cyclic training of the text classification model through the mixed analysis of a small amount of labeled samples and unlabeled samples, and reduces the number demand of labeled samples required in the text classification model training process on the basis of ensuring the accuracy of the trained text classification model, thereby reducing the manual labeling demand and improving the model training efficiency; in the absence of labeled samples, the accuracy of the text classification model is improved.
In some embodiments, the text classification model training system may further include a sample expansion unit, which is capable of expanding the labeled sample set according to the labeled sample until the sample size in the labeled sample set is equal to the sample size in the unlabeled sample set before the vector obtaining unit 502 inputs the sample set into the encoder to be trained.
The text classification model training system can improve the utilization rate of limited labeled samples, can ensure that the samples in a labeled sample set and an unlabeled sample set are synchronously extracted and finish the extraction, reduces the probability of data processing faults, and improves the convenience and reliability of subsequent data processing.
In some embodiments, the text classification model training system may further include a sample enhancement unit, which can pre-determine whether only the original sample and the class thereof are includedAnd generating a labeled sample set S' comprising the text of the original sample, the text of the enhanced sample and the analog identification. In some embodiments, the sample enhancement unit can perform synonym replacement on the original text to obtain the text s of the enhanced sample1. In some embodiments, the sample enhancement unit may further translate the original text twice, translate the original text from the original language to the second language, and then translate the original text back to the original language, and use the translated text as the text s of the enhanced sample2. In some embodiments, the sample enhancement unit may perform the sample enhancement operation in the above two ways at the same time, and respectively generate the text s of the first enhanced sample1And text s of a second enhancement sample2Generating samples (S, S) in the labeled sample set S1,s2,ys)。
The text classification model training system can expand the labeled sample amount with less operation amount by enhancing the sample operation mode, reduce the demand of manually labeling the sample category and improve the model training efficiency.
In some embodiments, the text classification model training system may further include: the batch extraction unit can respectively and sequentially extract the text of the original sample, the text of the enhanced sample and the text of the estimated marked sample in the expanded marked sample set and the estimated marked sample set according to the preset batch size after the marked sample set is expanded; generating a text vector to be coded according to the text of the original sample, the text of the enhanced sample and the text of the estimated marked sample, wherein the text vector to be coded comprises original sample dimensionality, enhanced sample dimensionality and estimated marked sample dimensionality, and the number of sample texts in each dimensionality accords with a preset batch size; and cutting the text in the text vector to be coded according to the upper limit of the preset text length.
The text classification model training system can extract sample texts in batches for processing, and sample data is fully utilized through multi-batch processing; the data volume of each time of processing is reduced, the operation burden of each link is reduced, and the reliability and the efficiency of training are improved.
In some embodiments, the textThe classification model training system may further include a coefficient adjustment unit capable of increasing the first estimated mixture coefficient by a predetermined ratio after each round of training is completed. For example, in an early stage of i, such as the first and second rounds of training, if the accuracy of the model is low and the accuracy of the estimated labeled sample is poor, λ is set to be small, so that the mixed sample is closer to the labeled sample; and increasing the lambda when the accuracy of the model is gradually increased along with the increase of the i, so that the estimation marking sample gradually plays a role, and the generalization capability of the model is enhanced. In some embodiments, the λ value of the ith round may be set to λiThen there is λi=λ(i-1)Eta, e.g. specifying lambda1=0.1,η=1.1。
The text classification model training system can avoid excessive interference of samples with poor accuracy on training through adjustment of the estimated mixing coefficient, gradually improve generalization capability and improve training efficiency.
A schematic diagram of some embodiments of the text classification system of the present disclosure is shown in fig. 6.
The text input unit 601 can input a text to be classified into a text classification model. Text classification model MNThe method is generated by any one of the above training methods of the text classification model or by any one of the above training systems of the text classification model.
The category determination unit 602 can take the classification estimation value output by the text classification model as the category of the text to be classified.
The text training system can classify the texts by adopting the text classification model trained on the basis of a small amount of labeled samples, can reduce the sample demand in the preparation process before classification is executed on the basis of ensuring the accuracy, and improves the training efficiency; under the condition of less labeled samples, the accuracy of text classification can be improved.
A schematic diagram of one embodiment of the disclosed data processing system is shown in fig. 7. The data processing system includes a memory 701 and a processor 702. Wherein: the memory 701 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is for storing instructions in the text classification model training method or the corresponding embodiment of the text classification method above. Processor 702 is coupled to memory 701 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 702 is configured to execute instructions stored in the memory, so that the sample requirement during preparation before classification can be reduced, and the training efficiency can be improved; under the condition of less labeled samples, the accuracy of text classification can be improved.
In one embodiment, as also shown in FIG. 8, data processing system 800 includes a memory 801 and a processor 802. The processor 802 is coupled to the memory 801 by a BUS 803. The data processing system 800 may also be coupled to external storage 805 via storage interface 804 to facilitate retrieval of external data, and may also be coupled to a network or another computer system (not shown) via network interface 806. And will not be described in detail herein.
In the embodiment, the data instruction is stored in the memory, and the instruction is processed by the processor, so that the sample demand in the preparation process before classification is executed can be reduced, and the training efficiency is improved; under the condition of less labeled samples, the accuracy of text classification can be improved.
In another embodiment, a computer-readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the steps of a text classification model training method or a method in a corresponding embodiment of a text classification method. As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Thus far, the present disclosure has been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
Finally, it should be noted that: the above examples are intended only to illustrate the technical solutions of the present disclosure and not to limit them; although the present disclosure has been described in detail with reference to preferred embodiments, those of ordinary skill in the art will understand that: modifications to the specific embodiments of the disclosure or equivalent substitutions for parts of the technical features may still be made; all such modifications are intended to be included within the scope of the claims of this disclosure without departing from the spirit thereof.

Claims (19)

1. A text classification model training method comprises the following steps: in each round of the training, the training is carried out,
determining a classification estimation value of each sample in an unlabeled sample set based on a text classification model to be trained, and acquiring an estimated labeled sample set;
obtaining a vector of the text of the labeled sample set and a vector of the text of the estimated labeled sample set by an encoder to be trained;
acquiring a mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identification of the labeled sample; acquiring a mixed sample set according to the mixed labeled sample set, the vector of the text of the sample of the estimated labeled sample set, the classification estimation value and the estimated mixing coefficient;
inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjusting parameters of the encoder to be trained, the text classification model to be trained and the feedforward neural network according to a loss value obtained based on a loss function;
and when the number of training rounds reaches the preset number, acquiring a text classification model.
2. The method of claim 1, further comprising:
before the sample set is input into the encoder to be trained, the labeled sample set is expanded according to the labeled sample until the sample amount in the labeled sample set is equal to the sample amount in the unlabeled sample set.
3. The method of claim 1, wherein each labeled sample in the set of labeled samples comprises text of an original sample, text of an enhanced sample of the original sample, and a category identification of the text of the original sample.
4. The method of claim 3, wherein,
the enhanced samples of the original sample comprise at least one of a first enhanced sample or a second enhanced sample;
the text of the first enhanced sample is generated by carrying out synonym replacement on the text of the original sample;
the text of the second enhanced sample is generated by translating the text of the original sample into a second language and then translating the second language back to the original language.
5. The method of claim 3 or 4, further comprising:
generating the labeled sample set in advance according to the original sample of the labeled category.
6. The method of claim 2, wherein,
the obtaining, by an encoder to be trained, a vector of text of the labeled sample set, and the estimating a vector of text of the labeled sample set includes: inputting texts of samples in the labeled sample set and the estimated labeled sample set into the encoder to be trained in batches by taking a preset batch size as a unit, and acquiring vectors of the texts of the labeled sample set and the estimated labeled sample set of each batch;
the obtaining a mixed labeled sample set and the obtaining a mixed sample set comprises: acquiring the mixed labeled sample set of each batch and a mixed sample set of a corresponding batch;
adjusting parameters of the encoder to be trained, the text classification model to be trained, and the feed-forward neural network includes: and respectively inputting the mixed labeled sample set and the mixed sample set of each batch into a feedforward neural network, and adjusting parameters of the encoder to be trained, the text classification model to be trained and the feedforward neural network according to a loss value obtained based on a loss function until the mixed labeled sample set and the mixed sample set of all batches in the current training round are processed.
7. The method of claim 6, further comprising:
after the labeled sample set is expanded, according to the preset batch size, sequentially extracting a text of an original sample, a text of an enhanced sample and a text of an estimated labeled sample in the expanded labeled sample set and the estimated labeled sample set respectively, wherein the labeled sample comprises the text of the original sample and the text of the enhanced sample of the original sample;
generating a text vector to be coded according to the text of the original sample, the text of the enhanced sample and the text of the estimated labeled sample, wherein the text vector to be coded comprises original sample dimensions, enhanced sample dimensions and estimated labeled sample dimensions, and the number of the sample texts in each dimension accords with the preset batch size;
cutting out the text in the text vector to be coded according to the upper limit of the preset text length;
the obtaining the vector of the text of the labeled sample set and the vector of the text of the estimated labeled sample set for each batch comprises:
inputting the cut text vectors to be coded into the coder to be trained, and acquiring the text coding vectors of the current batch;
extracting elements of the original sample dimension and the enhanced sample dimension in the text coding vector to obtain a vector of the labeled sample set;
and extracting elements of the dimensionality of the estimation labeling sample in the text coding vector to obtain the vector of the estimation labeling sample set.
8. The method of claim 1, wherein the obtaining a mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identifier of the labeled sample comprises:
acquiring mixed marked sample codes according to codes corresponding to original samples and codes corresponding to enhanced samples in the vectors of the texts of each marked sample of the marked sample set and enhanced mixed coefficients;
obtaining a vector of a mixed labeled sample set according to the mixed labeled sample code and the code corresponding to the original sample;
and acquiring the mixed labeled sample set according to the vector of the mixed labeled sample set and the corresponding category identification of the original sample.
9. The method of claim 8, wherein the obtaining mixed labeled sample encodings comprises:
and taking the enhanced mixed coefficient as the weight of the code corresponding to the enhanced sample, and adding the weight and the code corresponding to the original sample to obtain the mixed marked sample code.
10. The method of claim 1, wherein said obtaining a mixed sample set from the vectors of text of the samples of the mixed labeled sample set, the estimated labeled sample set, and the classification estimates, and estimated mixing coefficients comprises:
acquiring a coding estimation labeling sample set according to the vector of the text of the estimation labeling sample set and the category identification of the estimation labeling sample;
estimating, for each sample in a vector of an annotated sample set, for the encoding:
respectively and randomly extracting a sample in the mixed labeled sample set;
and respectively calculating the weighted sum of the vector of the text and the class identifier by taking a first estimation mixing coefficient as the weight of the samples in the coding estimation labeling sample set and a second estimation mixing coefficient as the weight of the extracted samples in the mixed labeled sample set, and acquiring the samples of the mixed sample set, wherein the first estimation mixing coefficient and the second estimation mixing coefficient are 1.
11. The method of claim 10, further comprising:
after each round of training is completed, the first estimated mixing coefficient is increased by a predetermined ratio.
12. The method of claim 1, wherein the inputting the mixed labeled sample set and the mixed sample set into a feed-forward neural network and adjusting parameters of the encoder to be trained, the text classification model to be trained, and the feed-forward neural network according to a loss value obtained based on a loss function comprises:
inputting the mixed labeled sample set and the mixed sample set into a feedforward neural network, and outputting a processing result through a full connection layer;
and inputting the processing result into a loss function to obtain the loss value.
13. The method of claim 12, wherein the inputting the processing result into a loss function, the obtaining the loss value comprises:
acquiring cross entropy loss according to a processing result based on the mixed labeled sample set, and using the cross entropy loss as the first loss value;
obtaining a mean square error loss according to a processing result based on the mixed sample set, as the second loss value;
and acquiring a weighted value of the first loss value and the second loss value according to a preset loss value weight to serve as the loss value.
14. The method of claim 1, wherein the category identification comprises an emotion category identification.
15. A method of text classification, comprising:
inputting texts to be classified into a text classification model, wherein the text classification model is generated by training according to the text classification model training method of any one of claims 1-14;
and taking the classification estimation value output by the text classification model as the class of the text to be classified.
16. A text classification model training system, comprising:
the estimated sample set obtaining unit is configured to determine a classification estimation value of each sample in an unlabeled sample set based on a text classification model to be trained in each round of training, and obtain an estimated labeled sample set;
a vector obtaining unit configured to obtain, by an encoder to be trained, a vector of a text of a labeled sample set and a vector of the text of the estimated labeled sample set;
the mixing unit is configured to obtain a mixed labeled sample set according to the vector of the text of each labeled sample in the labeled sample set and the category identification of the labeled sample; acquiring a mixed sample set according to the mixed labeled sample set, the vector of the text of the sample of the estimated labeled sample set, the classification estimation value and the estimated mixing coefficient;
a parameter adjusting unit configured to input the mixed labeled sample set and the mixed sample set into a feedforward neural network, and adjust parameters of the encoder to be trained, the text classification model to be trained, and the feedforward neural network according to a loss value obtained based on a loss function;
a model obtaining unit configured to obtain the text classification model when the number of training rounds reaches a predetermined number of times.
17. A text classification system comprising:
the text input unit is configured to input texts to be classified into a text classification model, wherein the text classification model is generated by training according to the text classification model training method of any one of claims 1-14;
and the class determining unit is configured to take the classification estimation value output by the text classification model as the class of the text to be classified.
18. A data processing system comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the method of any of claims 1-15 based on instructions stored in the memory.
19. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 15.
CN202110494682.6A 2021-05-07 2021-05-07 Text classification model training and classifying method and system and data processing system Active CN113177119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110494682.6A CN113177119B (en) 2021-05-07 2021-05-07 Text classification model training and classifying method and system and data processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110494682.6A CN113177119B (en) 2021-05-07 2021-05-07 Text classification model training and classifying method and system and data processing system

Publications (2)

Publication Number Publication Date
CN113177119A true CN113177119A (en) 2021-07-27
CN113177119B CN113177119B (en) 2024-02-02

Family

ID=76928276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110494682.6A Active CN113177119B (en) 2021-05-07 2021-05-07 Text classification model training and classifying method and system and data processing system

Country Status (1)

Country Link
CN (1) CN113177119B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642659A (en) * 2021-08-19 2021-11-12 上海商汤科技开发有限公司 Training sample set generation method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522958A (en) * 2020-05-28 2020-08-11 泰康保险集团股份有限公司 Text classification method and device
CN111723209A (en) * 2020-06-28 2020-09-29 上海携旅信息技术有限公司 Semi-supervised text classification model training method, text classification method, system, device and medium
CN111966831A (en) * 2020-08-18 2020-11-20 创新奇智(上海)科技有限公司 Model training method, text classification device and network model
CN112214605A (en) * 2020-11-05 2021-01-12 腾讯科技(深圳)有限公司 Text classification method and related device
WO2021008037A1 (en) * 2019-07-15 2021-01-21 平安科技(深圳)有限公司 A-bilstm neural network-based text classification method, storage medium, and computer device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021008037A1 (en) * 2019-07-15 2021-01-21 平安科技(深圳)有限公司 A-bilstm neural network-based text classification method, storage medium, and computer device
CN111522958A (en) * 2020-05-28 2020-08-11 泰康保险集团股份有限公司 Text classification method and device
CN111723209A (en) * 2020-06-28 2020-09-29 上海携旅信息技术有限公司 Semi-supervised text classification model training method, text classification method, system, device and medium
CN111966831A (en) * 2020-08-18 2020-11-20 创新奇智(上海)科技有限公司 Model training method, text classification device and network model
CN112214605A (en) * 2020-11-05 2021-01-12 腾讯科技(深圳)有限公司 Text classification method and related device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宋建国;: "基于半监督与词向量加权的文本分类研究", 软件导刊, no. 09 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642659A (en) * 2021-08-19 2021-11-12 上海商汤科技开发有限公司 Training sample set generation method and device, electronic equipment and storage medium
WO2023019908A1 (en) * 2021-08-19 2023-02-23 上海商汤智能科技有限公司 Method and apparatus for generating training sample set, and electronic device, storage medium and program

Also Published As

Publication number Publication date
CN113177119B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN110348214B (en) Method and system for detecting malicious codes
CN110362819B (en) Text emotion analysis method based on convolutional neural network
US20170308526A1 (en) Compcuter Implemented machine translation apparatus and machine translation method
CN111339305A (en) Text classification method and device, electronic equipment and storage medium
CN111680494A (en) Similar text generation method and device
CN110276071A (en) A kind of text matching technique, device, computer equipment and storage medium
CN111506709B (en) Entity linking method and device, electronic equipment and storage medium
CN110059183A (en) A kind of automobile industry User Perspective sensibility classification method based on big data
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium
CN112329482A (en) Machine translation method, device, electronic equipment and readable storage medium
CN113177119A (en) Text classification model training and classifying method and system and data processing system
CN111680529A (en) Machine translation algorithm and device based on layer aggregation
CN115080750A (en) Weak supervision text classification method, system and device based on fusion prompt sequence
CN113553847A (en) Method, device, system and storage medium for parsing address text
CN112906403B (en) Semantic analysis model training method and device, terminal equipment and storage medium
CN111325033A (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
Latif et al. Can large language models aid in annotating speech emotional data? uncovering new frontiers
CN109753646B (en) Article attribute identification method and electronic equipment
CN116186562B (en) Encoder-based long text matching method
CN108475265B (en) Method and device for acquiring unknown words
CN112749530B (en) Text encoding method, apparatus, device and computer readable storage medium
CN113157914B (en) Document abstract extraction method and system based on multilayer recurrent neural network
CN113935387A (en) Text similarity determination method and device and computer readable storage medium
CN117271778B (en) Insurance outbound session information output method and device based on generation type large model
CN116227496B (en) Deep learning-based electric public opinion entity relation extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant