CN112528030A

CN112528030A - Semi-supervised learning method and system for text classification

Info

Publication number: CN112528030A
Application number: CN202110173996.6A
Authority: CN
Inventors: 李越超
Original assignee: Zhongguancun Smart City Co Ltd
Current assignee: Zhongguancun Smart City Co Ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-03-19

Abstract

The embodiment of the invention provides a semi-supervised learning method and a semi-supervised learning system for text classification, which comprise the following steps: acquiring a sample set for a related task, wherein the sample set comprises a marked sample set and an unmarked sample set; preprocessing the sample set; predicting and classifying the preprocessed unmarked sample set, and expanding the sample set; and training the deep learning model by adopting the extended sample set. The method can supervise the problem of lacking of label data in learning, learn tasks by using unlabelled data and labeled sample sets, and solve the problem of lacking of class labels by using a clustering method in an unsupervised method. In the method provided by the invention, under the condition of a small quantity of labels, the efficiency of text classification is greatly improved, the labor cost is reduced, and the accuracy of unsupervised classification is improved.

Description

Semi-supervised learning method and system for text classification

Technical Field

The invention relates to the technical field of machine learning, in particular to a semi-supervised learning method and system for text classification.

Background

At present, tasks in natural language processing generally include subtasks such as text classification, entity recognition, emotion recognition and the like, the text classification task refers to classifying texts into specific labels, and currently, models are generally trained by using supervised methods such as deep learning and machine learning and unsupervised methods, and then classification is performed based on the trained models.

In many application scenarios of text classification, collecting large tagged data typically requires a lot of manpower to perform predictive labeling, however manual labeling is inefficient and expensive or not feasible. When the sample size is small, the supervised learning mode is usually used for training, however, in the case that the labeled data set is small, the typical supervised learning algorithm is easy to overfit, and the data features cannot be effectively characterized. Unsupervised learning is generally a common method when there is a large amount of sample data but there is no specific desired result or label, and unsupervised learning cannot provide reliable class information and cannot meet the requirement of accurate classification in a scenario where sample diversity is not particularly significant.

In the text classification task, because a small amount of labeled sample data exists, the categories are not comprehensive enough, and in addition, a large amount of unlabeled sample data exists and contains all categories, the scene cannot be effectively processed by using supervised learning and unsupervised learning independently in the scene.

Disclosure of Invention

Therefore, in order to solve the above technical problem, embodiments of the present invention provide a semi-supervised learning method and system for text classification, which can supervise the problem of the lack of label data in learning, and learn a task using unlabelled data and labeled sample sets, and solve the problem of the lack of class labels using a clustering method in an unsupervised method. In the method provided by the invention, under the condition of a small quantity of labels, the efficiency of text classification is greatly improved, the labor cost is reduced, and the accuracy of unsupervised classification is improved. The specific technical scheme is as follows:

in order to achieve the above object, an embodiment of the present invention provides a semi-supervised learning method for text classification, including the steps of:

acquiring a sample set for a related task, wherein the sample set comprises a marked sample set and an unmarked sample set;

preprocessing the sample set;

predicting and classifying the preprocessed unmarked sample set, and expanding the sample set;

and training the deep learning model by adopting the extended sample set.

Further, the predicting and classifying the preprocessed unmarked sample set, and expanding the sample set specifically includes the steps of:

pre-training the deep learning model by adopting the labeled sample set to obtain a pre-training model;

predicting a first part of unlabeled samples by using the pre-training model, and setting a confidence threshold;

and comparing the prediction result of the first part of the unlabeled samples with the confidence threshold, and adding the unlabeled samples higher than the confidence threshold to the labeled sample set to finish the expansion.

Further, the method also comprises the following steps: and clustering the second part of unlabeled samples by adopting an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using the class corresponding to the cluster center as the class label of the samples, and using the class label as a model training sample.

Further, the deep learning model adopts a bert model.

Further, the unsupervised clustering algorithm adopts a KNN algorithm.

A second aspect of an embodiment of the present invention provides a semi-supervised learning system for text classification, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a sample set used for a related task, and the sample set comprises a marked sample set and an unmarked sample set;

the preprocessing module is used for preprocessing the sample set;

the expansion module is used for predicting and classifying the preprocessed unmarked sample set and expanding the sample set;

and the training module is used for training the deep learning model by adopting the extended sample set.

Further, the expansion module comprises:

the pre-training module is used for pre-training the deep learning model by adopting the labeled sample set to obtain a pre-training model;

the prediction module is used for predicting the first part of unlabeled samples by adopting the pre-training model and setting a confidence threshold;

and the expansion module is used for comparing the prediction result of the first part of the unlabeled samples with the confidence threshold, and adding the unlabeled samples higher than the confidence threshold to the labeled sample set to complete expansion.

The system further comprises a clustering module which is used for clustering the samples of the second part of unlabeled samples by adopting an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using the class corresponding to the cluster center as the class label of the samples and using the class label as a model training sample.

The third aspect of the embodiments of the present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, causes the processor to process the steps of the semi-supervised learning method for text classification described above.

A fourth aspect of the present invention provides an electronic apparatus comprising:

a processor; and the number of the first and second groups,

a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the method for semi-supervised learning for text classification described above.

The embodiment of the invention provides a semi-supervised learning method for text classification, which comprises the following steps: acquiring a sample set for a related task, wherein the sample set comprises a marked sample set and an unmarked sample set; preprocessing the sample set; predicting and classifying the preprocessed unmarked sample set, and expanding the sample set; and training the deep learning model by adopting the extended sample set. The method can supervise the problem of lacking of label data in learning, learn tasks by using unlabelled data and labeled sample sets, and solve the problem of lacking of class labels by using a clustering method in an unsupervised method. In the method provided by the invention, under the condition of a small quantity of labels, the efficiency of text classification is greatly improved, the labor cost is reduced, and the accuracy of unsupervised classification is improved.

Drawings

Fig. 1 is a flowchart of a semi-supervised learning method for text classification according to embodiment 1 of the present invention;

fig. 2 is a block diagram schematically illustrating a structure of a semi-supervised learning system for text classification according to embodiment 2 of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention;

fig. 4 is a schematic structural diagram of a computer-readable storage medium according to embodiment 4 of the present invention;

in the figure: 31-a processor; 32-a memory; 33-storage space; 34-program code; 41-program code.

Detailed Description

In order to clearly and thoroughly show the technical solution of the present invention, the following description is made with reference to the accompanying drawings, but the scope of the present invention is not limited thereto.

Referring to fig. 1, a flowchart of a semi-supervised learning method for text classification according to embodiment 1 of the present invention includes the steps of:

obtaining a sample set for a related task;

preprocessing the sample set;

and training the deep learning model by adopting the extended sample set.

The sample set comprises a marked sample set and an unmarked sample set, and the preprocessing comprises data cleaning of each marked sample and each unmarked sample. For example, assuming that a text classification model for a certain language (e.g., Chinese) needs to be trained, words in the sample that are not in that language are deleted. In addition, a cleaning process such as stop word filtering may be performed, where the stop word filtering process is performed by summarizing meaningless words such as "what was, what was" and the like in a preset stop table, and deleting the words in a sample when the words appear in the sample. It should be noted that the present embodiment does not limit any specific implementation of data cleansing.

In an alternative implementation of the embodiment of the present invention, in order to facilitate later application to the unmarked sample set, the unmarked sample set is divided into two parts, namely a first part and a second part.

In the construction of smart cities, a lot of text data needing to be processed are involved, and the application of the texts to subsequent tasks has the problems of small labeled data amount and incomplete labeled categories, such as classification tasks, and the like, so that a sample set needs to be expanded, and the method comprises the following steps: performing reverse translation processing on each unlabeled sample, and taking a reverse translation processing result as a corresponding data expansion sample; or, obtaining key words and non-key words in each unlabeled sample by using a TF-IDF algorithm; and performing word replacement processing on the non-keyword in each unlabeled sample, and taking a word replacement processing result as a corresponding data expansion sample.

In the embodiment of the present invention, the predicting and classifying the preprocessed unlabeled sample set, and expanding the sample set specifically include the steps of:

The deep learning model adopts a bert model. The unsupervised clustering algorithm adopts a KNN algorithm. The method is suitable for supervised learning by using a small amount of samples, calculating the confidence coefficient of the model on whether the samples belong to the class K, and then predicting data without marks. Those samples having a value above a predefined threshold are selected from the unlabeled dataset and their prediction is considered to be an addition of the sample data to the training samples. And clustering the rest unlabeled samples by using a KNN algorithm in unsupervised learning, wherein the core idea is that most of the samples in the feature space and K samples most adjacent to the samples belong to a certain class, and the samples belong to the class.

In an optional implementation manner of the embodiment of the present invention, the method further includes clustering the second part of unlabeled samples by using an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using a class corresponding to a cluster center as a class label of the samples, and using the class label as a model training sample.

Fig. 2 is a block diagram schematically illustrating a structure of a semi-supervised learning system for text classification according to embodiment 2 of the present invention, including:

the preprocessing module is used for preprocessing the sample set;

Further, the expansion module comprises:

a processor; and the number of the first and second groups,

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the apparatus for detecting a wearing state of an electronic device according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

For example, fig. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the invention. The electronic device conventionally comprises a processor 31 and a memory 32 arranged to store computer-executable instructions (program code). The memory 32 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 32 has a storage space 33 storing program code 34 for performing the method steps shown in fig. 1 and in any of the embodiments. For example, the storage space 33 for storing the program code may comprise respective program codes 34 for implementing the various steps in the above method, respectively. The program code can be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as described in fig. 4. The computer readable storage medium may have memory segments, memory spaces, etc. arranged similarly to the memory 32 in the electronic device of fig. 3. The program code may be compressed, for example, in a suitable form. In general, the memory space stores program code 41 for performing the steps of the method according to the invention, i.e. there may be program code, such as read by the processor 31, which, when run by the electronic device, causes the electronic device to perform the steps of the method described above.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A semi-supervised learning method for text classification, comprising the steps of:

preprocessing the sample set;

training the deep learning model by adopting the extended sample set;

the prediction and classification labeling are carried out on the preprocessed unmarked sample set, and the sample set is expanded, and the method specifically comprises the following steps:

comparing the prediction result of the first part of the unlabeled samples with the confidence threshold, adding the unlabeled samples higher than the confidence threshold to a labeled sample set, and completing expansion;

further comprising: and clustering the second part of unlabeled samples by adopting an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using the class corresponding to the cluster center as the class label of the samples, and using the class label as a model training sample.

2. The semi-supervised learning method for text classification as recited in claim 1, wherein the deep learning model employs a bert model.

3. The semi-supervised learning method for text classification according to claim 1, wherein the unsupervised clustering algorithm employs a KNN algorithm.

4. A semi-supervised learning system for text classification, comprising:

the preprocessing module is used for preprocessing the sample set;

the training module is used for training the deep learning model by adopting the extended sample set;

the expansion module comprises:

the expansion module is used for comparing the prediction result of the first part of the unlabeled samples with the confidence threshold, adding the unlabeled samples higher than the confidence threshold to the labeled sample set, and completing expansion;

the system also comprises a clustering module which is used for clustering the samples of the second part of unlabeled samples by adopting an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using the class corresponding to the cluster center as the class label of the samples and using the class label as a model training sample.

5. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to process the steps of the semi-supervised learning method for text classification of any one of claims 1-3.

6. An electronic device, comprising:

a processor; and the number of the first and second groups,

a memory arranged to store computer executable instructions that when executed cause the processor to perform the semi-supervised learning method for text classification of any of claims 1-3.