CN112528030A - Semi-supervised learning method and system for text classification - Google Patents

Semi-supervised learning method and system for text classification Download PDF

Info

Publication number
CN112528030A
CN112528030A CN202110173996.6A CN202110173996A CN112528030A CN 112528030 A CN112528030 A CN 112528030A CN 202110173996 A CN202110173996 A CN 202110173996A CN 112528030 A CN112528030 A CN 112528030A
Authority
CN
China
Prior art keywords
sample set
training
samples
module
semi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110173996.6A
Other languages
Chinese (zh)
Inventor
李越超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongguancun Smart City Co Ltd
Original Assignee
Zhongguancun Smart City Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongguancun Smart City Co Ltd filed Critical Zhongguancun Smart City Co Ltd
Priority to CN202110173996.6A priority Critical patent/CN112528030A/en
Publication of CN112528030A publication Critical patent/CN112528030A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a semi-supervised learning method and a semi-supervised learning system for text classification, which comprise the following steps: acquiring a sample set for a related task, wherein the sample set comprises a marked sample set and an unmarked sample set; preprocessing the sample set; predicting and classifying the preprocessed unmarked sample set, and expanding the sample set; and training the deep learning model by adopting the extended sample set. The method can supervise the problem of lacking of label data in learning, learn tasks by using unlabelled data and labeled sample sets, and solve the problem of lacking of class labels by using a clustering method in an unsupervised method. In the method provided by the invention, under the condition of a small quantity of labels, the efficiency of text classification is greatly improved, the labor cost is reduced, and the accuracy of unsupervised classification is improved.

Description

Semi-supervised learning method and system for text classification
Technical Field
The invention relates to the technical field of machine learning, in particular to a semi-supervised learning method and system for text classification.
Background
At present, tasks in natural language processing generally include subtasks such as text classification, entity recognition, emotion recognition and the like, the text classification task refers to classifying texts into specific labels, and currently, models are generally trained by using supervised methods such as deep learning and machine learning and unsupervised methods, and then classification is performed based on the trained models.
In many application scenarios of text classification, collecting large tagged data typically requires a lot of manpower to perform predictive labeling, however manual labeling is inefficient and expensive or not feasible. When the sample size is small, the supervised learning mode is usually used for training, however, in the case that the labeled data set is small, the typical supervised learning algorithm is easy to overfit, and the data features cannot be effectively characterized. Unsupervised learning is generally a common method when there is a large amount of sample data but there is no specific desired result or label, and unsupervised learning cannot provide reliable class information and cannot meet the requirement of accurate classification in a scenario where sample diversity is not particularly significant.
In the text classification task, because a small amount of labeled sample data exists, the categories are not comprehensive enough, and in addition, a large amount of unlabeled sample data exists and contains all categories, the scene cannot be effectively processed by using supervised learning and unsupervised learning independently in the scene.
Disclosure of Invention
Therefore, in order to solve the above technical problem, embodiments of the present invention provide a semi-supervised learning method and system for text classification, which can supervise the problem of the lack of label data in learning, and learn a task using unlabelled data and labeled sample sets, and solve the problem of the lack of class labels using a clustering method in an unsupervised method. In the method provided by the invention, under the condition of a small quantity of labels, the efficiency of text classification is greatly improved, the labor cost is reduced, and the accuracy of unsupervised classification is improved. The specific technical scheme is as follows:
in order to achieve the above object, an embodiment of the present invention provides a semi-supervised learning method for text classification, including the steps of:
acquiring a sample set for a related task, wherein the sample set comprises a marked sample set and an unmarked sample set;
preprocessing the sample set;
predicting and classifying the preprocessed unmarked sample set, and expanding the sample set;
and training the deep learning model by adopting the extended sample set.
Further, the predicting and classifying the preprocessed unmarked sample set, and expanding the sample set specifically includes the steps of:
pre-training the deep learning model by adopting the labeled sample set to obtain a pre-training model;
predicting a first part of unlabeled samples by using the pre-training model, and setting a confidence threshold;
and comparing the prediction result of the first part of the unlabeled samples with the confidence threshold, and adding the unlabeled samples higher than the confidence threshold to the labeled sample set to finish the expansion.
Further, the method also comprises the following steps: and clustering the second part of unlabeled samples by adopting an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using the class corresponding to the cluster center as the class label of the samples, and using the class label as a model training sample.
Further, the deep learning model adopts a bert model.
Further, the unsupervised clustering algorithm adopts a KNN algorithm.
A second aspect of an embodiment of the present invention provides a semi-supervised learning system for text classification, including:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a sample set used for a related task, and the sample set comprises a marked sample set and an unmarked sample set;
the preprocessing module is used for preprocessing the sample set;
the expansion module is used for predicting and classifying the preprocessed unmarked sample set and expanding the sample set;
and the training module is used for training the deep learning model by adopting the extended sample set.
Further, the expansion module comprises:
the pre-training module is used for pre-training the deep learning model by adopting the labeled sample set to obtain a pre-training model;
the prediction module is used for predicting the first part of unlabeled samples by adopting the pre-training model and setting a confidence threshold;
and the expansion module is used for comparing the prediction result of the first part of the unlabeled samples with the confidence threshold, and adding the unlabeled samples higher than the confidence threshold to the labeled sample set to complete expansion.
The system further comprises a clustering module which is used for clustering the samples of the second part of unlabeled samples by adopting an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using the class corresponding to the cluster center as the class label of the samples and using the class label as a model training sample.
The third aspect of the embodiments of the present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, causes the processor to process the steps of the semi-supervised learning method for text classification described above.
A fourth aspect of the present invention provides an electronic apparatus comprising:
a processor; and the number of the first and second groups,
a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the method for semi-supervised learning for text classification described above.
The embodiment of the invention provides a semi-supervised learning method for text classification, which comprises the following steps: acquiring a sample set for a related task, wherein the sample set comprises a marked sample set and an unmarked sample set; preprocessing the sample set; predicting and classifying the preprocessed unmarked sample set, and expanding the sample set; and training the deep learning model by adopting the extended sample set. The method can supervise the problem of lacking of label data in learning, learn tasks by using unlabelled data and labeled sample sets, and solve the problem of lacking of class labels by using a clustering method in an unsupervised method. In the method provided by the invention, under the condition of a small quantity of labels, the efficiency of text classification is greatly improved, the labor cost is reduced, and the accuracy of unsupervised classification is improved.
Drawings
Fig. 1 is a flowchart of a semi-supervised learning method for text classification according to embodiment 1 of the present invention;
fig. 2 is a block diagram schematically illustrating a structure of a semi-supervised learning system for text classification according to embodiment 2 of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention;
fig. 4 is a schematic structural diagram of a computer-readable storage medium according to embodiment 4 of the present invention;
in the figure: 31-a processor; 32-a memory; 33-storage space; 34-program code; 41-program code.
Detailed Description
In order to clearly and thoroughly show the technical solution of the present invention, the following description is made with reference to the accompanying drawings, but the scope of the present invention is not limited thereto.
Referring to fig. 1, a flowchart of a semi-supervised learning method for text classification according to embodiment 1 of the present invention includes the steps of:
obtaining a sample set for a related task;
preprocessing the sample set;
predicting and classifying the preprocessed unmarked sample set, and expanding the sample set;
and training the deep learning model by adopting the extended sample set.
The sample set comprises a marked sample set and an unmarked sample set, and the preprocessing comprises data cleaning of each marked sample and each unmarked sample. For example, assuming that a text classification model for a certain language (e.g., Chinese) needs to be trained, words in the sample that are not in that language are deleted. In addition, a cleaning process such as stop word filtering may be performed, where the stop word filtering process is performed by summarizing meaningless words such as "what was, what was" and the like in a preset stop table, and deleting the words in a sample when the words appear in the sample. It should be noted that the present embodiment does not limit any specific implementation of data cleansing.
In an alternative implementation of the embodiment of the present invention, in order to facilitate later application to the unmarked sample set, the unmarked sample set is divided into two parts, namely a first part and a second part.
In the construction of smart cities, a lot of text data needing to be processed are involved, and the application of the texts to subsequent tasks has the problems of small labeled data amount and incomplete labeled categories, such as classification tasks, and the like, so that a sample set needs to be expanded, and the method comprises the following steps: performing reverse translation processing on each unlabeled sample, and taking a reverse translation processing result as a corresponding data expansion sample; or, obtaining key words and non-key words in each unlabeled sample by using a TF-IDF algorithm; and performing word replacement processing on the non-keyword in each unlabeled sample, and taking a word replacement processing result as a corresponding data expansion sample.
In the embodiment of the present invention, the predicting and classifying the preprocessed unlabeled sample set, and expanding the sample set specifically include the steps of:
pre-training the deep learning model by adopting the labeled sample set to obtain a pre-training model;
predicting a first part of unlabeled samples by using the pre-training model, and setting a confidence threshold;
and comparing the prediction result of the first part of the unlabeled samples with the confidence threshold, and adding the unlabeled samples higher than the confidence threshold to the labeled sample set to finish the expansion.
The deep learning model adopts a bert model. The unsupervised clustering algorithm adopts a KNN algorithm. The method is suitable for supervised learning by using a small amount of samples, calculating the confidence coefficient of the model on whether the samples belong to the class K, and then predicting data without marks. Those samples having a value above a predefined threshold are selected from the unlabeled dataset and their prediction is considered to be an addition of the sample data to the training samples. And clustering the rest unlabeled samples by using a KNN algorithm in unsupervised learning, wherein the core idea is that most of the samples in the feature space and K samples most adjacent to the samples belong to a certain class, and the samples belong to the class.
In an optional implementation manner of the embodiment of the present invention, the method further includes clustering the second part of unlabeled samples by using an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using a class corresponding to a cluster center as a class label of the samples, and using the class label as a model training sample.
The embodiment of the invention provides a semi-supervised learning method for text classification, which comprises the following steps: acquiring a sample set for a related task, wherein the sample set comprises a marked sample set and an unmarked sample set; preprocessing the sample set; predicting and classifying the preprocessed unmarked sample set, and expanding the sample set; and training the deep learning model by adopting the extended sample set. The method can supervise the problem of lacking of label data in learning, learn tasks by using unlabelled data and labeled sample sets, and solve the problem of lacking of class labels by using a clustering method in an unsupervised method. In the method provided by the invention, under the condition of a small quantity of labels, the efficiency of text classification is greatly improved, the labor cost is reduced, and the accuracy of unsupervised classification is improved.
Fig. 2 is a block diagram schematically illustrating a structure of a semi-supervised learning system for text classification according to embodiment 2 of the present invention, including:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a sample set used for a related task, and the sample set comprises a marked sample set and an unmarked sample set;
the preprocessing module is used for preprocessing the sample set;
the expansion module is used for predicting and classifying the preprocessed unmarked sample set and expanding the sample set;
and the training module is used for training the deep learning model by adopting the extended sample set.
Further, the expansion module comprises:
the pre-training module is used for pre-training the deep learning model by adopting the labeled sample set to obtain a pre-training model;
the prediction module is used for predicting the first part of unlabeled samples by adopting the pre-training model and setting a confidence threshold;
and the expansion module is used for comparing the prediction result of the first part of the unlabeled samples with the confidence threshold, and adding the unlabeled samples higher than the confidence threshold to the labeled sample set to complete expansion.
The system further comprises a clustering module which is used for clustering the samples of the second part of unlabeled samples by adopting an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using the class corresponding to the cluster center as the class label of the samples and using the class label as a model training sample.
The third aspect of the embodiments of the present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, causes the processor to process the steps of the semi-supervised learning method for text classification described above.
A fourth aspect of the present invention provides an electronic apparatus comprising:
a processor; and the number of the first and second groups,
a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the method for semi-supervised learning for text classification described above.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the apparatus for detecting a wearing state of an electronic device according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
For example, fig. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the invention. The electronic device conventionally comprises a processor 31 and a memory 32 arranged to store computer-executable instructions (program code). The memory 32 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 32 has a storage space 33 storing program code 34 for performing the method steps shown in fig. 1 and in any of the embodiments. For example, the storage space 33 for storing the program code may comprise respective program codes 34 for implementing the various steps in the above method, respectively. The program code can be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as described in fig. 4. The computer readable storage medium may have memory segments, memory spaces, etc. arranged similarly to the memory 32 in the electronic device of fig. 3. The program code may be compressed, for example, in a suitable form. In general, the memory space stores program code 41 for performing the steps of the method according to the invention, i.e. there may be program code, such as read by the processor 31, which, when run by the electronic device, causes the electronic device to perform the steps of the method described above.
Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (6)

1. A semi-supervised learning method for text classification, comprising the steps of:
acquiring a sample set for a related task, wherein the sample set comprises a marked sample set and an unmarked sample set;
preprocessing the sample set;
predicting and classifying the preprocessed unmarked sample set, and expanding the sample set;
training the deep learning model by adopting the extended sample set;
the prediction and classification labeling are carried out on the preprocessed unmarked sample set, and the sample set is expanded, and the method specifically comprises the following steps:
pre-training the deep learning model by adopting the labeled sample set to obtain a pre-training model;
predicting a first part of unlabeled samples by using the pre-training model, and setting a confidence threshold;
comparing the prediction result of the first part of the unlabeled samples with the confidence threshold, adding the unlabeled samples higher than the confidence threshold to a labeled sample set, and completing expansion;
further comprising: and clustering the second part of unlabeled samples by adopting an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using the class corresponding to the cluster center as the class label of the samples, and using the class label as a model training sample.
2. The semi-supervised learning method for text classification as recited in claim 1, wherein the deep learning model employs a bert model.
3. The semi-supervised learning method for text classification according to claim 1, wherein the unsupervised clustering algorithm employs a KNN algorithm.
4. A semi-supervised learning system for text classification, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a sample set used for a related task, and the sample set comprises a marked sample set and an unmarked sample set;
the preprocessing module is used for preprocessing the sample set;
the expansion module is used for predicting and classifying the preprocessed unmarked sample set and expanding the sample set;
the training module is used for training the deep learning model by adopting the extended sample set;
the expansion module comprises:
the pre-training module is used for pre-training the deep learning model by adopting the labeled sample set to obtain a pre-training model;
the prediction module is used for predicting the first part of unlabeled samples by adopting the pre-training model and setting a confidence threshold;
the expansion module is used for comparing the prediction result of the first part of the unlabeled samples with the confidence threshold, adding the unlabeled samples higher than the confidence threshold to the labeled sample set, and completing expansion;
the system also comprises a clustering module which is used for clustering the samples of the second part of unlabeled samples by adopting an unsupervised clustering algorithm, dividing the samples into a plurality of clusters, using the class corresponding to the cluster center as the class label of the samples and using the class label as a model training sample.
5. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to process the steps of the semi-supervised learning method for text classification of any one of claims 1-3.
6. An electronic device, comprising:
a processor; and the number of the first and second groups,
a memory arranged to store computer executable instructions that when executed cause the processor to perform the semi-supervised learning method for text classification of any of claims 1-3.
CN202110173996.6A 2021-02-09 2021-02-09 Semi-supervised learning method and system for text classification Pending CN112528030A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110173996.6A CN112528030A (en) 2021-02-09 2021-02-09 Semi-supervised learning method and system for text classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110173996.6A CN112528030A (en) 2021-02-09 2021-02-09 Semi-supervised learning method and system for text classification

Publications (1)

Publication Number Publication Date
CN112528030A true CN112528030A (en) 2021-03-19

Family

ID=74975654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110173996.6A Pending CN112528030A (en) 2021-02-09 2021-02-09 Semi-supervised learning method and system for text classification

Country Status (1)

Country Link
CN (1) CN112528030A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861842A (en) * 2021-03-22 2021-05-28 天津汇智星源信息技术有限公司 Case text recognition method based on OCR and electronic equipment
CN113807171A (en) * 2021-08-10 2021-12-17 三峡大学 Text classification method based on semi-supervised transfer learning
CN113988176A (en) * 2021-10-27 2022-01-28 支付宝(杭州)信息技术有限公司 Sample labeling method and device
CN114595333A (en) * 2022-04-27 2022-06-07 之江实验室 Semi-supervision method and device for public opinion text analysis
CN114691875A (en) * 2022-04-22 2022-07-01 光大科技有限公司 Data classification and classification processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657744A (en) * 2015-01-29 2015-05-27 中国科学院信息工程研究所 Multi-classifier training method and classifying method based on non-deterministic active learning
US9053391B2 (en) * 2011-04-12 2015-06-09 Sharp Laboratories Of America, Inc. Supervised and semi-supervised online boosting algorithm in machine learning framework
CN107316049A (en) * 2017-05-05 2017-11-03 华南理工大学 A kind of transfer learning sorting technique based on semi-supervised self-training
CN107943856A (en) * 2017-11-07 2018-04-20 南京邮电大学 A kind of file classification method and system based on expansion marker samples
CN111723209A (en) * 2020-06-28 2020-09-29 上海携旅信息技术有限公司 Semi-supervised text classification model training method, text classification method, system, device and medium
CN112069310A (en) * 2020-06-18 2020-12-11 中国科学院计算技术研究所 Text classification method and system based on active learning strategy

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9053391B2 (en) * 2011-04-12 2015-06-09 Sharp Laboratories Of America, Inc. Supervised and semi-supervised online boosting algorithm in machine learning framework
CN104657744A (en) * 2015-01-29 2015-05-27 中国科学院信息工程研究所 Multi-classifier training method and classifying method based on non-deterministic active learning
CN107316049A (en) * 2017-05-05 2017-11-03 华南理工大学 A kind of transfer learning sorting technique based on semi-supervised self-training
CN107943856A (en) * 2017-11-07 2018-04-20 南京邮电大学 A kind of file classification method and system based on expansion marker samples
CN112069310A (en) * 2020-06-18 2020-12-11 中国科学院计算技术研究所 Text classification method and system based on active learning strategy
CN111723209A (en) * 2020-06-28 2020-09-29 上海携旅信息技术有限公司 Semi-supervised text classification model training method, text classification method, system, device and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
唐焕玲等: "《利用置信度重样本的SemiBoost-CR分类模型》", 《计算机科学与探索》 *
朱晨光: "《机器阅读理解:算法与实践》", 31 January 2020 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861842A (en) * 2021-03-22 2021-05-28 天津汇智星源信息技术有限公司 Case text recognition method based on OCR and electronic equipment
CN113807171A (en) * 2021-08-10 2021-12-17 三峡大学 Text classification method based on semi-supervised transfer learning
CN113807171B (en) * 2021-08-10 2023-09-29 三峡大学 Text classification method based on semi-supervised transfer learning
CN113988176A (en) * 2021-10-27 2022-01-28 支付宝(杭州)信息技术有限公司 Sample labeling method and device
CN114691875A (en) * 2022-04-22 2022-07-01 光大科技有限公司 Data classification and classification processing method and device
CN114595333A (en) * 2022-04-27 2022-06-07 之江实验室 Semi-supervision method and device for public opinion text analysis
CN114595333B (en) * 2022-04-27 2022-08-09 之江实验室 Semi-supervision method and device for public opinion text analysis
WO2023092961A1 (en) * 2022-04-27 2023-06-01 之江实验室 Semi-supervised method and apparatus for public opinion text analysis

Similar Documents

Publication Publication Date Title
CN112528030A (en) Semi-supervised learning method and system for text classification
CN101542531B (en) Image recognizing apparatus and image recognizing method
US20170132314A1 (en) Identifying relevant topics for recommending a resource
CN107423278B (en) Evaluation element identification method, device and system
US20170154077A1 (en) Method for comment tag extraction and electronic device
CN109558482B (en) Parallelization method of text clustering model PW-LDA based on Spark framework
CN107273883B (en) Decision tree model training method, and method and device for determining data attributes in OCR (optical character recognition) result
CN112084435A (en) Search ranking model training method and device and search ranking method and device
CN107844531B (en) Answer output method and device and computer equipment
CN110969015B (en) Automatic label identification method and equipment based on operation and maintenance script
CN113205814A (en) Voice data labeling method and device, electronic equipment and storage medium
CN116150367A (en) Emotion analysis method and system based on aspects
CN113297379A (en) Text data multi-label classification method and device
CN110852076B (en) Method and device for automatic disease code conversion
CN116070632A (en) Informal text entity tag identification method and device
CN110717013B (en) Vectorization of documents
CN111444718A (en) Insurance product demand document processing method and device and electronic equipment
CN116663536B (en) Matching method and device for clinical diagnosis standard words
CN115687576B (en) Keyword extraction method and device represented by theme constraint
CN111898378A (en) Industry classification method and device for government and enterprise clients, electronic equipment and storage medium
CN107368464B (en) Method and device for acquiring bidding product information
CN112115362B (en) Programming information recommendation method and device based on similar code recognition
WO2018220688A1 (en) Dictionary generator, dictionary generation method, and program
CN115455969A (en) Medical text named entity recognition method, device, equipment and storage medium
CN111400606B (en) Multi-label classification method based on global and local information extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210319