CN115269824A - Method and device for classifying texts - Google Patents

Method and device for classifying texts Download PDF

Info

Publication number
CN115269824A
CN115269824A CN202110483872.8A CN202110483872A CN115269824A CN 115269824 A CN115269824 A CN 115269824A CN 202110483872 A CN202110483872 A CN 202110483872A CN 115269824 A CN115269824 A CN 115269824A
Authority
CN
China
Prior art keywords
text
text data
enhanced
score
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110483872.8A
Other languages
Chinese (zh)
Inventor
尹丁艺
赵龙刚
钱兵
王峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202110483872.8A priority Critical patent/CN115269824A/en
Publication of CN115269824A publication Critical patent/CN115269824A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to a method and apparatus for classifying text. A method of training a text classification model using text data is provided, wherein the text data comprises unlabeled text data, the method comprising: enhancing the label-free text data to obtain a plurality of enhanced text data; calculating a composite score for each enhanced text data of the plurality of enhanced text data using an LDA topic extraction model based on the unlabeled text data and the plurality of enhanced text data; screening enhanced text data with high quality from the plurality of enhanced text data according to the comprehensive score sorting; and training the text classification model by using a loss function based on the screened enhanced text data and the corresponding label-free text data.

Description

Method and device for classifying texts
Technical Field
The present disclosure relates to the field of artificial intelligence, and more particularly, to a method and apparatus for classifying texts.
Background
For the classification problem of the text with few samples in the field of artificial intelligence, the current solutions in the traditional field mainly include methods such as meta learning, now-shot learning, semi-supervised learning and the like.
Meta-learning and how-shot learning require a large amount of data similar to a classification target, have more rigorous requirement on preposed data, and have insignificant effect in industrial application at present. The semi-supervised learning needs a large amount of unlabelled data of target problems, supervised learning is carried out by carrying out treatment such as pseudo-marking on the unlabelled data, or data enhancement is carried out on each unlabelled data in modes such as retranslation and EDA, and the supervised learning of consistency among samples is carried out by utilizing semantic invariance of data enhancement, so that the generalization capability of the model is enhanced.
The non-label data is easy to collect, but if the collected non-label data is poor in quality, the non-label data cannot play a role in supervised learning of few samples, and the final effect of the model may be reduced. Therefore, the data quality of the unlabeled data and the enhanced data is crucial to the final effect of the algorithm. How to perform quality screening of enhanced data of label-free data is a problem to be solved for ensuring the effect of a classification model.
Accordingly, there is a need for a method and apparatus for classifying text to be predicted that overcomes the above-mentioned deficiencies of the prior art and solves the quality screening problem.
Disclosure of Invention
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. However, it should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
According to a first aspect of the present disclosure, there is provided a method of training a text classification model using text data, wherein the text data comprises unlabeled text data, the method comprising: enhancing the label-free text data to obtain a plurality of enhanced text data; calculating a composite score for each enhanced text data of the plurality of enhanced text data using an LDA topic extraction model based on the unlabeled text data and the plurality of enhanced text data; screening enhanced text data with high quality from the plurality of enhanced text data according to the comprehensive score sorting; and training the text classification model by using a loss function based on the screened enhanced text data and the corresponding label-free text data.
According to a second aspect of the present disclosure, there is provided a method for classifying a text to be predicted, including: and classifying the text to be predicted by utilizing a text classification model obtained by training according to the method of the first aspect.
According to a third aspect of the present disclosure, there is provided an apparatus for training a text classification model using text data, wherein the text data comprises unlabeled text data, the apparatus comprising: a text enhancement module configured to enhance the unlabeled text data to obtain a plurality of enhanced text data; a screening module configured to: calculating a composite score of each enhanced text data in the plurality of enhanced text data by using an LDA topic extraction model based on the unlabeled text data and the plurality of enhanced text data, and screening the enhanced text data with high quality from the plurality of enhanced text data according to the ranking of the composite score; and a training module configured to train the text classification model using a loss function based on the filtered enhanced text data and corresponding unlabeled text data.
According to a fourth aspect of the present disclosure, there is provided an apparatus for classifying text to be predicted, the apparatus being configured to: and classifying the text to be predicted by utilizing a text classification model obtained by training according to the method of the first aspect.
According to a fifth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having a program stored thereon, wherein the program, when executed by a computer, causes the computer to perform the method according to the first aspect.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having a program stored thereon, wherein the program, when executed by a computer, causes the computer to perform the method according to the second aspect.
According to a seventh aspect of the present disclosure, there is provided an apparatus for training a text classification model using text data, comprising a memory communicatively coupled with the processor and a processor, the memory having stored therein a program which, when executed by the processor, causes the processor to carry out the method according to the first aspect.
According to an eighth aspect of the present disclosure, there is provided an apparatus for classifying text to be predicted, comprising a memory and a processor, the memory being communicatively coupled to the processor, the memory having stored therein a program which, when executed by the processor, causes the processor to perform the method according to the second aspect.
According to a ninth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect.
According to a tenth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the second aspect.
By using the method and the device provided by the disclosure, the theme distribution of the unlabeled text and the enhanced unlabeled text is extracted by adopting an unsupervised clustering algorithm LDA (Latent Dirichlet Allocation) model, the quality of text enhancement is controlled from two dimensions of text quality and semantic similarity of the unlabeled text and the newly added text thereof, high-quality enhancement data is screened out to be used as enhancement data of the unlabeled data which can be used for training, and semi-supervised text classification is carried out by combining the enhancement data and the loss of the semantic consistency of the unlabeled text with the cross entropy loss of the labeled text.
Other features of the present disclosure and advantages thereof will become more apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:
FIG. 1 shows a schematic diagram of a method for training a text classification model according to an embodiment of the present disclosure;
FIG. 2 shows a schematic diagram of a method for training a text classification model according to an embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of a method for data screening according to an embodiment of the present disclosure;
FIG. 4 shows a schematic diagram of a method for calculating a composite score, according to an embodiment of the present disclosure;
FIG. 5 shows a flow diagram of a method for training a text classification model using text data in accordance with an embodiment of the present disclosure;
FIG. 6 illustrates an exemplary configuration of a computing device in which embodiments in accordance with the present disclosure may be implemented.
Detailed Description
The following detailed description is made with reference to the accompanying drawings and is provided to assist in a comprehensive understanding of various exemplary embodiments of the disclosure. The following description includes various details to aid understanding, but these details are to be regarded as examples only and are not intended to limit the disclosure, which is defined by the appended claims and their equivalents. The words and phrases used in the following description are used only to provide a clear and consistent understanding of the disclosure. In addition, descriptions of well-known structures, functions, and configurations may be omitted for clarity and conciseness. Those of ordinary skill in the art will recognize that various changes and modifications of the examples described herein can be made without departing from the spirit and scope of the disclosure.
The work order system is an important means for network monitoring operation and maintenance of operators. Work order processing involves multiple links, some of which (such as hang-up, receipt verification, etc.) are done in a semi-automated fashion (such as to manual validation). At present, automatic and intelligent methods are generally adopted to gradually replace the manual work. However, the network operation and maintenance work order has the defects of insufficient sample size, incapability of using a pre-training model (poor generalization capability), incapability of ensuring data quality and semantic correlation by traditional data enhancement and the like due to the small and numerous fields of the network operation and maintenance work order. This may result in less than ideal intelligent processing of the web operation and maintenance work order (i.e., classification of the work order text).
Work order text is employed in this disclosure as a non-limiting example of textual data. It should be understood that the text data is not limited thereto.
Fig. 1 and 2 show schematic diagrams of methods for training a text classification model according to embodiments of the present disclosure.
As shown in fig. 1 and 2, the historical work order text data may include tagged data and untagged data.
For labeled data, training of the model may be performed by computing a cross entropy loss function based on the labels predicted by the model and the true labels of the labeled data.
For the non-tag data, text enhancement can be performed by methods such as EDA (electronic design automation), translation and the like, high-quality non-tag enhanced data are screened out through a data quality screening module, and the model is trained by calculating consistency loss (KL divergence) functions of the non-tag data and the enhanced data with high quality, so that the generalization capability of the model is greatly improved.
The text classification models that may be used in the present disclosure may include the classic text classification models of TextCNN, LSTM, bert, etc., and are not limited to the specific models listed above, but are extremely applicable. When the data volume is small, an algorithm with low model complexity, such as TextCNN, can be adopted; in the case of medium data volume and the like, models such as LSTM and the like can be selected; when the data volume is large, a pre-training model such as Bert can be adopted.
Fig. 3 shows a schematic diagram of a method for data screening according to an embodiment of the present disclosure.
As shown in fig. 3, the data filtering may include a text quality filtering and a text consistency filtering. The text quality screening may include calculating a text subject definition score, a text generation probability score and a text word distribution score, and calculating a text quality score based on the three scores. The text correspondence screening may include calculating a text correspondence score based on a topic probability distribution of the unlabeled text data and a topic probability distribution of the enhanced text data.
Fig. 4 shows a schematic diagram of a method for calculating a composite score according to an embodiment of the present disclosure.
As shown in fig. 4, for the non-tag text data, enhancement (expansion) can be performed by using methods such as EDA (electronic design automation), translation and the like; for the enhanced text data, further processing may be performed, such as word segmentation and the like; for the processed enhanced text data, an LDA topic extraction model may be input to extract a topic probability distribution thereof. After extracting the topic probability distribution, a text enhancement quality score (quality _ score) and a text coherence score (coherence _ score) may be calculated to arrive at a final composite score.
LDA is an unsupervised machine learning model that can be used to identify potential topic information in a corpus. The input to the model may include a document corpus consisting of a plurality of documents, and the output thereof may include:
1. the topic distribution of each document in the corpus (i.e., a plurality of work orders, by way of example and not limitation), namely a document-topic probability distribution (theta _1, theta _2 \8230; theta _ k), wherein k is the number of topics given in the model;
2. probability distribution of each topic-word
Figure BDA0003049527720000061
Wherein k is the number of given topics in the model, and n is the number of non-repeated words contained in the corpus;
3. topic z assigned to each word of each document according to a probability distributionik
For text correspondence screening, a text correspondence score may be calculated, including the steps of:
for the texts d1 and d2 (wherein d1 is unlabeled text data, d2 is an enhanced text obtained by enhancing d 1), extracting the topic probability distribution vectors of the texts by using an LDA topic extraction model respectively as document expressions:
d1=(θ1112…θ1k),d2=(θ2122…θ2k);
calculating the cosine similarity of the obtained text topic vectors to obtain the topic similarity of two texts d1 and d 2:
Figure BDA0003049527720000062
normalizing the cosine similarity to obtain a text consistency score:
Dcoherence_score=-0.5*cos(d1,d2)+0.5。
for text quality screening, it may include calculating a text specificity score, a text generation probability score, and a text word distribution score, respectively. Wherein the text specificity score describes a degree of specificity of a subject of the enhanced text data, the text generation probability score describes a degree of grammatical correctness of the enhanced text data, and the text word distribution score describes a degree of similarity of the enhanced text data to the numerous text data.
Calculating the text disambiguation score may comprise the steps of:
for enhanced text diExtracting the following theme probability distribution vector as the enhanced text d by using an LDA theme extraction modeliExpression of (a):
di=(θi1i2…θik);
calculate the text diAnd inverting the subject entropy:
Figure BDA0003049527720000071
Figure BDA0003049527720000072
normalization is performed to get a text disambiguation score:
Figure BDA0003049527720000073
of these, max (D)topic) Maximum of the inverse of the subject entropy for each text, min (D)topic) Is the minimum of the subject entropy inverses of each text.
The text topic definition score mainly represents the topic definition of the text by entropy calculation (normalization) of the extracted text topic vector, and the larger the score is, the more defined the topic of the text is.
Calculating the text generation probability score may include the steps of:
for enhanced text diAnd calculating the generation probability of the text:
Figure BDA0003049527720000074
Figure BDA0003049527720000075
wherein the content of the first and second substances,
Figure BDA0003049527720000076
is the word w1In the text diTopics assigned by the LDA model are descended;
the probability of generation of the text is logined:
Figure BDA0003049527720000077
Figure BDA0003049527720000078
for the calculated Dgenerate(di) And (3) carrying out normalization:
Figure BDA0003049527720000079
the generation process of the text is assumed to be a bag-of-words model, different words are independent before each other, and the generation probability of a word in the text is the probability of a topic determined by the word and the probability of generating the word under the topic. Thus, by multiplying the text-to-topic probability and the topic-to-vocabulary probability, a text generation probability score may be calculated. The larger the score, the more correct the language grammar for the text is indicated, and the more likely it is that the text is generated.
Calculating the text word distribution score may include the steps of:
for enhanced text d, the probability of generation of each word under a particular topic distribution of the text is calculated:
Figure BDA0003049527720000081
computing a many-text average probability vector
Figure BDA0003049527720000082
Figure BDA0003049527720000083
Figure BDA0003049527720000084
Wherein, corpus represents a Corpus;
two kinds of calculation are carried out on the text enhanced probability vector and the mass text average probability vectorSimilarity, i.e. 1-JS divergence 1-DJsPearson's correlation coefficient Dpearson_score
Figure BDA0003049527720000085
Figure BDA0003049527720000086
Figure BDA0003049527720000087
Figure BDA0003049527720000088
Figure BDA0003049527720000089
And calculating a text word distribution score:
Dword_score=β*Dpearson_score+(1-β)*(1-DJS),
wherein beta is a hyperparameter for adjusting the weight of the 1-JS divergence and the Pearson correlation coefficient. The larger beta is, the larger the weight of the Pearson correlation coefficient is, and the smaller the weight of the 1-JS divergence is; the smaller β, the smaller the weight of the pearson correlation coefficient, and the larger the weight of the 1-JS divergence. The value of β may be generally 0.2, 0.5, 0.8, etc., depending on the amount of data.
The text word distribution score is intended to measure the similarity of the individual text word generation probability and the mass text word generation probability. If the probability vector of the enhanced text is closer to the mass text average probability vector, the larger the text word distribution score, indicating a higher text word quality.
The text quality score may be calculated as follows:
Figure BDA0003049527720000091
it should be noted that the text quality score is not limited to the text clarity score, the text generation probability score and the text word distribution score being averaged, but the three scores may be weighted differently as needed.
After the text consistency score and the text quality score are calculated, a final composite score may be calculated as follows:
Dscore=αDcoherence_score+(1-α)Dquality_score
wherein α is a hyper-parameter that adjusts the weight of the text consistency score and the text quality score in the final composite score of the text quality. The larger the alpha is, the higher the similarity of the screened text topics is; the smaller alpha is, the higher the self quality of the screened text is. The value of α can be generally 0.7, 0.5, 0.3, etc. according to the data amount from small to large.
The data screening may include the steps of:
training an LDA theme extraction model through historical work order data (labeled text + unlabeled text);
text enhancement is performed on the text with each unlabeled text to generate a plurality of enhanced texts, and a composite score is calculated for each enhanced text:
Dscore=αDcoherence_score+(1-α)Dquality_score
and sequencing the comprehensive scores of the plurality of enhanced texts from top to bottom according to the comprehensive scores, and screening k texts with the highest comprehensive scores as the final enhanced texts of the unlabeled texts.
FIG. 5 shows a flow diagram of a method for training a text classification model using text data in accordance with an embodiment of the present disclosure.
As shown in fig. 5, at S501, the unlabeled text data may be enhanced to obtain a plurality of enhanced text data;
at S502, a composite score for each enhanced text data of the plurality of enhanced text data may be calculated using an LDA topic extraction model based on the unlabeled text data and the plurality of enhanced text data;
at S503, enhanced text data with high quality may be filtered from the plurality of enhanced text data in order of the composite score; and
at S504, a text classification model may be trained using a loss function based on the filtered enhanced text data and corresponding unlabeled text data.
Fig. 6 illustrates an exemplary configuration of a computing device 600 capable of implementing embodiments in accordance with the present disclosure.
Computing device 600 is an example of a hardware device to which the above-described aspects of the disclosure can be applied. Computing device 600 may be any machine configured to perform processes and/or computations. Computing device 600 may be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a Personal Data Assistant (PDA), a smart phone, an in-vehicle computer, or a combination thereof.
As shown in fig. 6, computing device 600 may include one or more elements that may be connected to or in communication with bus 602 via one or more interfaces. The bus 602 may include, but is not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, a Peripheral Component Interconnect (PCI) bus, and the like. Computing device 600 may include, for example, one or more processors 604. The one or more processors 604 may be any kind of processor, and may include, but are not limited to, one or more general-purpose processors or special-purpose processors (such as special-purpose processing chips). The processor may for example be configured to implement the method as described hereinbefore.
The computing device 600 may also include Random Access Memory (RAM) 610 and Read Only Memory (ROM) 612. The ROM 612 may store programs, utilities or processes to be executed in a nonvolatile manner. The RAM 610 may provide volatile data storage and stores instructions related to the operation of the computing device 600.
The computing device 600 may also include or be connected to a non-transitory storage device 614, which non-transitory storage device 614 may be any non-transitory and may implement a storage of data, and may include, but is not limited to, a disk drive, an optical storage device, solid state memory, a floppy disk, a flexible disk, a hard disk, a magnetic tape, or any other magnetic medium, a compact disk, or any other optical medium, cache memory, and/or any other memory chip or module, and/or any other medium from which a computer can read data, instructions, and/or code.
In summary, according to a first aspect of the present disclosure, there is provided a method for training a text classification model using text data, wherein the text data includes unlabeled text data, the method including: enhancing the label-free text data to obtain a plurality of enhanced text data; calculating a composite score for each enhanced text data of the plurality of enhanced text data using an LDA topic extraction model based on the unlabeled text data and the plurality of enhanced text data; screening enhanced text data with high quality from the plurality of enhanced text data according to the comprehensive score sorting; and training the text classification model by using a loss function based on the screened enhanced text data and the corresponding label-free text data.
In an embodiment according to the present disclosure, calculating the composite score comprises: obtaining the theme probability distribution of the enhanced text data and the corresponding theme probability distribution of the unlabeled text data by using an LDA theme extraction model; calculating to obtain a text consistency score based on the theme probability distribution of the unlabeled text data and the theme probability distribution of the enhanced text data; calculating to obtain a text enhancement quality score based on the theme probability distribution of the enhanced text data; and calculating the comprehensive score based on the text consistency score and the text enhancement quality score.
In an embodiment according to the present disclosure, the text enhancement quality score comprises: a text specificity score describing a degree of specificity of a subject matter of the enhanced text data; a text generation probability score describing how grammatically correct the enhanced text data is; and a text word distribution score describing how similar the enhanced text data is to the crowd text data.
In an embodiment according to the present disclosure, the text data further includes tag text data, and the method further includes: and training the text classification model by utilizing the loss function of the labels predicted based on the text classification model and the actual labels of the labeled text data.
In an embodiment according to the present disclosure, the LDA topic extraction model is trained based on historical text data, where the historical text data includes labeled historical text data and unlabeled historical text data.
According to a second aspect of the present disclosure, there is provided a method for classifying a text to be predicted, including: and classifying the text to be predicted by utilizing a text classification model obtained by training according to the method of the first aspect.
According to a third aspect of the present disclosure, there is provided an apparatus for training a text classification model using text data, wherein the text data comprises unlabeled text data, the apparatus comprising: a text enhancement module configured to enhance the unlabeled text data to obtain a plurality of enhanced text data; a screening module configured to: calculating a composite score of each enhanced text data in the plurality of enhanced text data by using an LDA topic extraction model based on the label-free text data and the plurality of enhanced text data, and screening the enhanced text data with high quality from the plurality of enhanced text data according to the ranking of the composite scores; and a training module configured to train the text classification model using a loss function based on the filtered enhanced text data and corresponding unlabeled text data.
In an embodiment according to the present disclosure, calculating the composite score includes: obtaining the theme probability distribution of the enhanced text data and the corresponding theme probability distribution of the unlabeled text data by using an LDA theme extraction model; calculating to obtain a text consistency score based on the theme probability distribution of the unlabeled text data and the theme probability distribution of the enhanced text data; calculating to obtain a text enhancement quality score based on the theme probability distribution of the enhanced text data; and calculating the comprehensive score based on the text consistency score and the text enhancement quality score.
In an embodiment according to the present disclosure, the text enhancement quality score comprises: a text specificity score describing a degree of specificity of a subject matter of the enhanced text data; a text generation probability score describing how grammatically correct the enhanced text data is; and a text word distribution score describing how similar the enhanced text data is to the crowd text data.
In an embodiment according to the disclosure, the text data further includes tagged text data, the training module is further configured to: and training the text classification model by utilizing the loss function of the labels predicted based on the text classification model and the actual labels of the labeled text data.
In an embodiment according to the present disclosure, the LDA topic extraction model is trained based on historical text data, where the historical text data includes labeled historical text data and unlabeled historical text data.
According to a fourth aspect of the present disclosure, there is provided an apparatus for classifying text to be predicted, the apparatus being configured to: and classifying the text to be predicted by utilizing a text classification model obtained by training according to the method of the first aspect.
According to a fifth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium having a program stored thereon, characterized in that, when the program is executed by a computer, the computer is caused to execute the method according to the first aspect.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium having a program stored thereon, wherein the program, when executed by a computer, causes the computer to perform the method according to the second aspect.
According to a seventh aspect of the present disclosure, there is provided an apparatus for training a text classification model with text data, comprising a memory and a processor, the memory being communicatively coupled with the processor, the memory having stored therein a program, which when executed by the processor, causes the processor to perform the method according to the first aspect.
According to an eighth aspect of the present disclosure, there is provided an apparatus for classifying text to be predicted, comprising a memory and a processor, the memory being communicatively coupled to the processor, the memory having stored therein a program which, when executed by the processor, causes the processor to perform the method according to the second aspect.
According to a ninth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect.
According to a tenth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the second aspect.
The method and apparatus according to the present disclosure performs semi-supervised text classification by using data enhancement and unlabeled text semantic consistency loss in combination with cross entropy loss of labeled text, where enhancing the relevance of data text quality and semantics is crucial. According to the method and the device, the text theme is extracted by using the unsupervised text clustering algorithm LDA, and text quality screening is performed from the two aspects of text generation and semantics, so that new errors caused by the fact that the syntax of sentences of text enhancement is not smooth and the text semantics are offset in the traditional semi-supervised field are solved.
Specifically, for text data with labels, training a text classification model by a cross entropy loss function of model prediction labels and real labels; for the unlabeled text data, performing text data enhancement in modes of EDA (electronic design automation), retracing and the like, grading the enhanced data of each unlabeled text by a data screening module, performing quality screening, and selecting top k samples as final enhanced texts; and then, learning the supervised classification model through a consistency loss function (KL divergence) of the unlabeled text and the enhanced text, thereby enhancing the generalization capability of the model and improving the final effect of the algorithm.
If the meta-learning and how-fast-shot learning frames are adopted to classify the texts with few samples, a large amount of front label data with rich types are needed, and the effect is better as the front label data is closer to the target data, which is more rigorous on the data. And the semi-supervised framework is adopted in the method, and only a plurality of similar label-free samples are needed. Such samples are readily available even in the niche area, such as in the context of a network operation and maintenance work order.
It is worth noting that in the present disclosure, unsupervised LDA topic extraction model extraction features are used for text quality screening and text similarity calculation, which avoids the great errors brought by supervised models due to small domains and fewer samples in the case of adopting a supervised mode. If a supervised mode is used for text quality screening and text similarity calculation, a large amount of labeled data is required for training. The quality screening is performed to solve the problem of few samples through a semi-supervised mode, and if a large amount of existing label data exist, supervised classification is directly performed, and unsupervised text quality screening is not needed. That is, there is a possibility that the egg or the laying hen may have a problem. The method adopts an unsupervised theme extraction model LDA, and reliable enhanced text quality screening can be performed without labeled data, so that the contradiction is well solved.
The method and apparatus according to the present disclosure may have the following advantages:
the method adopts a semi-supervised algorithm to solve the problem of text classification with few samples, and effectively solves the problem of text classification with less labeled data but more unlabeled data in a specific scene through sample enhancement and diversity loss learning;
the quality of the text is screened from two angles of text generation quality and text semantic invariance by performing the quality screening of the text based on the theme distribution, so that the quality of the data after the screening is enhanced can be better ensured; on the other hand, the quality of the text is screened through topic distribution, and the text representation is carried out by adopting an unsupervised algorithm (LDA), so that extra error noise caused by text quality screening in a supervised mode can be eliminated;
an unsupervised theme extraction algorithm LDA is introduced into the field of text classification with few samples, scenes with rare label data in the special field can be well performed by utilizing a method that the algorithm does not need to have label sample characteristics and is combined with semi-supervision, and the unsupervised (LDA), the supervised (text classification) and the semi-supervision methods are fused into a framework to solve the problem that the label text data in the special field are lack.
The subject matter of the present disclosure is provided as examples of apparatus, systems, methods, and programs for performing the features described in the present disclosure. However, other features or variations are contemplated in addition to the features described above. It is contemplated that the implementation of the components and functions of the present disclosure may be accomplished with any emerging technology that may replace the technology of any of the implementations described above.
Additionally, the above description provides examples, and does not limit the scope, applicability, or configuration set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the spirit and scope of the disclosure. Various embodiments may omit, substitute, or add various procedures or components as appropriate. For example, features described with respect to certain embodiments may be combined in other embodiments.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims (18)

1. A method of training a text classification model using text data, wherein the text data comprises unlabeled text data, the method comprising:
enhancing the label-free text data to obtain a plurality of enhanced text data;
calculating a composite score for each enhanced text data of the plurality of enhanced text data using an LDA topic extraction model based on the unlabeled text data and the plurality of enhanced text data;
screening enhanced text data with high quality from the plurality of enhanced text data according to the comprehensive score sorting; and
and training the text classification model by using a loss function based on the screened enhanced text data and the corresponding label-free text data.
2. The method of claim 1, wherein calculating a composite score comprises:
obtaining the theme probability distribution of the enhanced text data and the corresponding theme probability distribution of the unlabeled text data by using an LDA theme extraction model;
calculating to obtain a text consistency score based on the theme probability distribution of the unlabeled text data and the theme probability distribution of the enhanced text data;
calculating to obtain a text enhancement quality score based on the theme probability distribution of the enhanced text data; and
and calculating to obtain the comprehensive score based on the text consistency score and the text enhancement quality score.
3. The method of claim 2, wherein the text enhancement quality score comprises:
a text specificity score describing a degree of specificity of a subject matter of the enhanced text data;
a text generation probability score describing how grammatically correct the enhanced text data is; and
a text word distribution score that describes how similar the enhanced text data is to the crowd text data.
4. The method of claim 1, wherein the text data further comprises tagged text data, the method further comprising:
and training the text classification model by using the loss function of the labels predicted based on the text classification model and the actual labels of the labeled text data.
5. The method of claim 1, wherein said LDA topic extraction model is trained based on historical text data, said historical text data comprising labeled historical text data and unlabeled historical text data.
6. A method of classifying text to be predicted, comprising:
classifying the text to be predicted by using a text classification model trained according to the method of any one of claims 1-5.
7. An apparatus for training a text classification model using text data, wherein the text data comprises unlabeled text data, the apparatus comprising:
a text enhancement module configured to enhance the unlabeled text data to obtain a plurality of enhanced text data;
a screening module configured to:
based on the unlabeled text data and the plurality of enhanced text data, calculating a composite score for each of the plurality of enhanced text data using an LDA topic extraction model, and
screening enhanced text data with high quality from the plurality of enhanced text data according to the comprehensive score sorting; and
a training module configured to train the text classification model using a loss function based on the screened enhanced text data and corresponding unlabeled text data.
8. The apparatus of claim 7, wherein calculating a composite score comprises:
obtaining the theme probability distribution of the enhanced text data and the corresponding theme probability distribution of the unlabeled text data by using an LDA theme extraction model;
calculating to obtain a text consistency score based on the theme probability distribution of the unlabeled text data and the theme probability distribution of the enhanced text data;
calculating to obtain a text enhancement quality score based on the theme probability distribution of the enhanced text data; and
and calculating the comprehensive score based on the text consistency score and the text enhancement quality score.
9. The apparatus of claim 8, wherein the text enhancement quality score comprises:
a text specificity score describing a degree of specificity of a subject matter of the enhanced text data;
a text generation probability score describing how grammatically correct the enhanced text data is; and
a text word distribution score that describes how similar the enhanced text data is to the crowd text data.
10. The apparatus of claim 7, wherein the text data further comprises tagged text data, the training module further configured to:
and training the text classification model by using the loss function of the labels predicted based on the text classification model and the actual labels of the labeled text data.
11. The apparatus of claim 7, wherein said LDA topic extraction model is trained based on historical text data, said historical text data comprising labeled historical text data and unlabeled historical text data.
12. An apparatus for classifying text to be predicted, the apparatus configured to:
classifying the text to be predicted by using a text classification model trained according to the method of any one of claims 1-5.
13. A non-transitory computer-readable storage medium having a program stored thereon, wherein the program, when executed by a computer, causes the computer to perform the method according to any one of claims 1-5.
14. A non-transitory computer-readable storage medium on which a program is stored, characterized in that, when the program is executed by a computer, it causes the computer to execute the method according to claim 6.
15. An apparatus for training a text classification model using text data, comprising a memory and a processor, the memory communicatively coupled with the processor, the memory having stored therein a program that, when executed by the processor, causes the processor to perform the method of any of claims 1-5.
16. An apparatus for classifying text to be predicted, comprising a memory and a processor, the memory communicatively coupled with the processor, the memory having stored therein a program that, when executed by the processor, causes the processor to perform the method of claim 6.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.
18. A computer program product comprising a computer program which, when executed by a processor, carries out the method according to claim 6.
CN202110483872.8A 2021-04-30 2021-04-30 Method and device for classifying texts Pending CN115269824A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110483872.8A CN115269824A (en) 2021-04-30 2021-04-30 Method and device for classifying texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110483872.8A CN115269824A (en) 2021-04-30 2021-04-30 Method and device for classifying texts

Publications (1)

Publication Number Publication Date
CN115269824A true CN115269824A (en) 2022-11-01

Family

ID=83745847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110483872.8A Pending CN115269824A (en) 2021-04-30 2021-04-30 Method and device for classifying texts

Country Status (1)

Country Link
CN (1) CN115269824A (en)

Similar Documents

Publication Publication Date Title
US20210256051A1 (en) Theme classification method based on multimodality, device, and storage medium
Wang et al. Application of convolutional neural network in natural language processing
CN107256221B (en) Video description method based on multi-feature fusion
Pappas et al. Multilingual hierarchical attention networks for document classification
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN110705206B (en) Text information processing method and related device
CN110717324B (en) Judgment document answer information extraction method, device, extractor, medium and equipment
CN111738251A (en) Optical character recognition method and device fused with language model and electronic equipment
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN114596566B (en) Text recognition method and related device
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN113627151B (en) Cross-modal data matching method, device, equipment and medium
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN109271624B (en) Target word determination method, device and storage medium
CN116775872A (en) Text processing method and device, electronic equipment and storage medium
Patel et al. Dynamic lexicon generation for natural scene images
CN113051887A (en) Method, system and device for extracting announcement information elements
CN117011737A (en) Video classification method and device, electronic equipment and storage medium
CN116049367A (en) Visual-language pre-training method and device based on non-supervision knowledge enhancement
CN113096687B (en) Audio and video processing method and device, computer equipment and storage medium
CN113407776A (en) Label recommendation method and device, training method and medium of label recommendation model
CN116910251A (en) Text classification method, device, equipment and medium based on BERT model
US20240119743A1 (en) Pre-training for scene text detection
CN116484224A (en) Training method, device, medium and equipment for multi-mode pre-training model
CN115269824A (en) Method and device for classifying texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination