WO2022160449A1 - Text classification method and apparatus, electronic device, and storage medium - Google Patents

Text classification method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2022160449A1
WO2022160449A1 PCT/CN2021/083560 CN2021083560W WO2022160449A1 WO 2022160449 A1 WO2022160449 A1 WO 2022160449A1 CN 2021083560 W CN2021083560 W CN 2021083560W WO 2022160449 A1 WO2022160449 A1 WO 2022160449A1
Authority
WO
WIPO (PCT)
Prior art keywords
classification
model
text
confidence
label
Prior art date
Application number
PCT/CN2021/083560
Other languages
French (fr)
Chinese (zh)
Inventor
谢馥芯
王磊
陈又新
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022160449A1 publication Critical patent/WO2022160449A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of natural language processing, and in particular, to a text classification method, apparatus, electronic device, and computer-readable storage medium.
  • a text classification method provided by this application includes:
  • the processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained.
  • the confidence space includes a first confidence that the processed text belongs to the first classification label;
  • the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  • the present application also provides a text classification device, the device comprising:
  • a model acquisition module used for acquiring a multi-model structure classification voting model and a multi-task classification model, the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
  • a text preprocessing module used to obtain the text to be classified, and preprocess the text to be classified to obtain the processed text
  • the first model analysis module is used to input the processed text into the multi-model structure classification voting model, and classify the processed text through a plurality of base models in the multi-model structure classification voting model to obtain the first a confidence space, where the first confidence space includes a first confidence that the processed text belongs to a first classification label;
  • the second model analysis module is configured to input the processed text into the multi-task classification model, and obtain a second confidence space by classifying the processed text in the multi-task classification model.
  • the confidence space includes the second confidence that the processed text belongs to the second classification label;
  • a result processing module configured to determine, according to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs and the classification confidence corresponding to the classification label.
  • the present application also provides an electronic device, the electronic device comprising:
  • the memory stores computer program instructions executable by the at least one processor, the computer program instructions being executed by the at least one processor to enable the at least one processor to perform the steps of:
  • the processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained.
  • the confidence space includes a first confidence that the processed text belongs to the first classification label;
  • the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  • the present application also provides a computer-readable storage medium, including a storage data area and a storage program area, the storage data area stores created data, and the storage program area stores a computer program; wherein, the computer program is implemented as follows when executed by a processor step:
  • the processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained.
  • the confidence space includes a first confidence that the processed text belongs to the first classification label;
  • the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  • FIG. 1 is a schematic flowchart of a text classification method provided by an embodiment of the present application.
  • FIG. 2 is a schematic block diagram of a text classification apparatus according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the internal structure of an electronic device implementing a text classification method provided by an embodiment of the present application
  • the embodiment of the present application provides a text classification method.
  • the execution subject of the text classification method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal.
  • the text classification method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • the text classification method includes:
  • the multi-model structure classification voting model is obtained by training multiple base models through a pre-built training sample set, and then performing performance ranking and weight setting on the output results of each base model.
  • the method before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the method further includes:
  • the multi-model structure classification voting model is constructed using the plurality of text classification models.
  • the classification model is a BERT model.
  • the method before acquiring the training sample set, includes: acquiring a pre-built corpus set, and performing quantization and cleaning operations on the corpus set to obtain a training sample set.
  • the corpus set is the texts that have been classified and processed in the past, or the corpus texts of which types are obtained on the network and have been classified.
  • a quantization operation is performed on the corpus set to obtain quantified data
  • a cleaning operation is performed on the quantified data to obtain the training sample set.
  • the quantization operation includes converting the text of the float32 data type in the corpus set into a uint8 data type suitable for training a text classification model; the cleaning includes deduplicating the quantized data and filling empty values.
  • the random forest algorithm is an ensemble learning algorithm for classification.
  • the random forest algorithm is used to randomly select 25% of the data from the training sample set with the preset value Q times with replacement to train the text classification model, and obtain Q preset values. the text classification model.
  • the preset value Q is 5.
  • the construction of the multi-model structure classification voting model by using the multiple text classification models includes:
  • weights are set for the multiple text classification models according to preset weight gradient values, so as to obtain the multi-model structure classification voting model.
  • the confidence formula of the multi-model structure classification voting model is:
  • p q is the weight of the q-th text classification model
  • y q (x) is the confidence result of the q-th text classification model
  • model testing samples are texts of a determined type.
  • the analysis results obtained by the 5 text classification models are: [Model 1: Negative emotion class, confidence level 90%; Model 2: Negative emotion Class, confidence level 86%; Model 3: Negative emotion class, confidence level 96%; Model 4: Negative emotion class, confidence level 82%; Model 5: Negative emotion class, confidence level 79%], then classify 5 texts model, and the base model confidence level table [Model 3, Model 1; Model 2; Model 4; Model 5] is obtained by arranging according to the confidence level.
  • the weights are allocated as [Model 3: Weight 0.3, Model 1: Weight 0.25; Model 2: Weight 0.2; Model 4: Weight 0.15; Model 5: Weight 0.1], according to the weight
  • the five text classification models are combined to obtain the multi-model structure classification voting model.
  • the method before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the method further includes:
  • the optimized classification model is trained by using the sentence vector, and the multi-task classification model is obtained when the descending gradient of the improved loss of the optimized classification model is smaller than a preset loss threshold within a preset training step.
  • the training sample set includes different types of standard sentences in addition to the corpus, and the pre-built similarity loss is:
  • the N is the number of standard sentences in the training sample set, and each standard sentence in the training sample represents a type, is the sentence vector of the specified corpus, and x j is the sentence vector of the standard sentence.
  • the obtained improvement loss is:
  • the confidence calculation formula of the classification label to which each corpus belongs is:
  • z j is the classification result of the jth short sentence in the corpus, that is, the classification label described by the jth short sentence, and K is the number of classification results.
  • the embodiment of the present application can continuously train and optimize the classification model by using the sentence vector through the twin network.
  • the improvement loss is continuously minimized until the improvement loss of the optimized classification model decreases.
  • the training process is stopped, and the multi-task classification model is obtained.
  • a pre-built recall engine may be used to obtain the text to be classified from the Internet or a local storage space.
  • the S2 includes:
  • Punctuation coincidence segmentation or sentence length segmentation is performed on the text to be classified to obtain the processed text.
  • the text to be classified is segmented according to punctuation, that is, the text to be classified is divided according to punctuation; when the volume of the text to be classified is greater than 512 characters, The to-be-classified text is segmented by sentence length, for example, the to-be-classified text is randomly divided into processing texts whose volume is less than 521 characters.
  • the multi-model structure classification voting model is used to classify the processing text. Perform classification to obtain a type result and a confidence level corresponding to the type result, calculate the confidence level generated by the five models through weights, and obtain a first classification label and a first confidence level of the processed text.
  • the confidence levels obtained by the processed text through the five models are [0.8, 0.9, 0.6, 0.5, 0.7], and the weights of the five models are [0.25, 0.2, 0.3, 0.15, 0.1]
  • the first confidence level is obtained as [0.8*0.25+0.9*0.2+0.6*0.3+0.5*0.15+0.7*0.1], that is, the first confidence level is 0.705.
  • the multi-task classification model includes a classification task and a similarity task.
  • the multi-task classification model is used to analyze the processed text to obtain a similarity set between the processed text and each type of standard sentence and a confidence set corresponding to the similarity, and then filter the similarity set to obtain The type corresponding to the standard sentence with the highest similarity of the processed text is used as the second classification label of the processed text, and according to the second classification label, the confidence level set is queried to obtain the second confidence level.
  • S5. Determine, according to the first confidence space and the second confidence space, a classification label to which the text to be classified belongs and a classification confidence corresponding to the classification label.
  • the classification to which the text to be classified belongs can be determined according to the confidence level threshold. Label.
  • the confidence level with the confidence level greater than the confidence level threshold such as 0.8
  • the classification label corresponding to the confidence level as the classification result or select the confidence level from the first confidence level and the second confidence level.
  • the confidence degree greater than the confidence degree threshold such as 0.5
  • the S5 includes:
  • the first classification label is the same as the second classification label
  • determine that the classification label to which the text to be classified belongs is the first classification label or the second classification label, and determine where the text to be classified belongs.
  • the confidence corresponding to the classification label is the average of the first confidence and the second confidence.
  • the predicted result of outputting the processed text is (label 1, confidence level 0.8), and the multi-task classification model prediction result is (label 1, The confidence level is 0.7), and the prediction type is the same as label 1, then the confidence level is added and averaged, and finally the type of the text to be classified is determined as label 1, the confidence level is 0.75, and the output (label is 1, confidence level 0.75).
  • the S5 further includes:
  • the first confidence level is greater than the second confidence level, determine that the category label to which the text to be classified belongs is the first category label, and the confidence level corresponding to the category result is the first confidence level multiplied by a first coefficient;
  • the first confidence level is not greater than the second confidence level, it is determined that the category label to which the text to be classified belongs is the second category label, and the confidence level corresponding to the category result is the second confidence level multiplied by a second coefficient.
  • the values of the first coefficient and the second coefficient may be the same or different, for example, the values of the first coefficient and the second coefficient are both 0.5.
  • the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8), and the prediction result of the multi-task classification model is (label 2, confidence level 0.9)
  • the prediction result of the multi-task classification model is (label 2, confidence level 0.9)
  • the type result with large degree is multiplied by 0.5
  • the type of the text to be classified is determined as label 2
  • the confidence degree is 0.45
  • the output is (label 2, confidence degree 0.45).
  • the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8)
  • the prediction result of the multi-task classification model is (label 2, confidence level 0.8)
  • the confidence levels are the same, then randomly select A type result, multiply the corresponding confidence by 0.5, and finally determine the type of the text to be classified as label 1 or label 2, the confidence is 0.4, and output (label 1 or 2, confidence 0.4).
  • the classification confidence level corresponding to the classification label to which the text to be classified belongs may also be added to the training sample set.
  • the multi-task classification model/or the multi-model structure classification voting model can be further optimized by continuously expanding the training sample set, which is beneficial to increase the accuracy of the confidence result.
  • the multi-model structure classification voting model and the multi-task classification model are used to classify the text to be classified, respectively, to obtain the first confidence that the processed text belongs to the first classification label and the first confidence that the processed text belongs to the second classification label.
  • Two confidence levels and then determine the classification label described in the processing text according to the first confidence level and the second confidence level, which can improve the accuracy of text category judgment by combining different models for category judgment, and in the process of text classification. , without manual intervention, improving the efficiency of text category judgment; at the same time, because the first confidence level that the processed text belongs to the first classification label is obtained by analyzing the processed text through each base model in the multi-model structure classification voting model, many times Analysis to improve the accuracy of text category judgment. Therefore, the text classification method proposed in this application can achieve the purpose of improving both the reliability of the text classification result and the efficiency of the text classification.
  • FIG. 2 it is a schematic diagram of a module of a text classification apparatus of the present application.
  • the text classification apparatus 100 described in this application can be installed in an electronic device.
  • the text classification apparatus may include a model acquisition module 101 , a text preprocessing module 102 , a first model analysis module 103 , a second model analysis module 104 and a result processing module 105 .
  • the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the model obtaining module 101 is configured to obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set.
  • the multi-model structure classification voting model is obtained by training multiple base models through a pre-built training sample set, and then performing performance ranking and weight setting on the output results of each base model.
  • the device further includes a multi-model structure classification voting model building module, and the multi-model structure classification voting model building module includes:
  • a first training unit used for training a pre-built classification model according to the random forest algorithm and the training sample set, to obtain a plurality of text classification models
  • a construction unit configured to construct the multi-model structure classification voting model by using the plurality of text classification models.
  • the classification model is a BERT model.
  • the obtaining unit includes: before obtaining the training sample set, obtaining a pre-built corpus set, and performing quantification and cleaning operations on the corpus set to obtain the training sample set.
  • the corpus set is the texts that have been classified and processed in the past, or the corpus texts of which types are obtained on the network and have been classified.
  • a quantization operation is performed on the corpus set to obtain quantified data
  • a cleaning operation is performed on the quantified data to obtain the training sample set.
  • the quantization operation includes converting the text of the float32 data type in the corpus set into a uint8 data type suitable for training a text classification model; the cleaning includes deduplicating the quantized data and filling empty values.
  • the random forest algorithm is an ensemble learning algorithm for classification.
  • the random forest algorithm is used to randomly select 25% of the data from the training sample set with the preset value Q times with replacement to train the text classification model, and obtain Q preset values. the text classification model.
  • the preset value Q is 5.
  • construction unit is specifically used for:
  • weights are set for the multiple text classification models according to preset weight gradient values, so as to obtain the multi-model structure classification voting model.
  • the confidence formula of the multi-model structure classification voting model is:
  • p q is the weight of the q-th text classification model
  • y q (x) is the confidence result of the q-th text classification model
  • model testing samples are texts of a determined type.
  • the analysis results obtained by the 5 text classification models are: [Model 1: Negative emotion class, confidence level 90%; Model 2: Negative emotion Class, confidence level 86%; Model 3: Negative emotion class, confidence level 96%; Model 4: Negative emotion class, confidence level 82%; Model 5: Negative emotion class, confidence level 79%], then classify 5 texts model, and the base model confidence level table [Model 3, Model 1; Model 2; Model 4; Model 5] is obtained by arranging according to the confidence level.
  • the weights are allocated as [Model 3: Weight 0.3, Model 1: Weight 0.25; Model 2: Weight 0.2; Model 4: Weight 0.15; Model 5: Weight 0.1], according to the weight
  • the five text classification models are combined to obtain the multi-model structure classification voting model.
  • the device further includes a multi-task classification model acquisition module, and the multi-task classification model acquisition module includes:
  • An optimized classification model acquisition unit configured to combine the classification loss in the classification model with the pre-built similarity loss to obtain an improved loss, and replace the classification loss in the classification model with the improved loss to obtain an optimized classification Model;
  • a feature extraction unit configured to perform feature extraction on the training sample set by utilizing the feature extraction neural network in the optimized classification model to obtain a sentence vector
  • the second training unit is configured to train the optimized classification model by using the sentence vector, until the descending gradient of the improved loss of the optimized classification model is smaller than a preset loss threshold within a preset training step, obtain The multi-task classification model.
  • the training sample set includes different types of standard sentences in addition to the corpus, and the pre-built similarity loss is:
  • the N is the number of standard sentences in the training sample set, and each standard sentence in the training sample represents a type, is the sentence vector of the specified corpus, and x j is the sentence vector of the standard sentence.
  • the obtained improvement loss is:
  • the confidence calculation formula of the classification label to which each corpus belongs is:
  • z j is the classification result of the jth short sentence in the corpus, that is, the classification label described by the jth short sentence, and K is the number of classification results.
  • the embodiment of the present application can continuously train and optimize the classification model by using the sentence vector through the twin network.
  • the improvement loss is continuously minimized until the improvement loss of the optimized classification model decreases.
  • the training process is stopped, and the multi-task classification model is obtained.
  • the text preprocessing module 102 is configured to acquire the text to be classified, and preprocess the text to be classified to obtain the processed text.
  • a pre-built recall engine may be used to obtain the text to be classified from the Internet or a local storage space.
  • the text preprocessing module 102 is specifically used for:
  • Punctuation coincidence segmentation or sentence length segmentation is performed on the text to be classified to obtain the processed text.
  • the text to be classified is segmented according to punctuation, that is, the text to be classified is divided according to punctuation; when the volume of the text to be classified is greater than 512 characters, The to-be-classified text is segmented by sentence length, for example, the to-be-classified text is randomly divided into processing texts whose volume is less than 521 characters.
  • the first model analysis module 103 is configured to input the processed text into the multi-model structure classification voting model, and classify the processed text through a plurality of base models in the multi-model structure classification voting model, A first confidence space is obtained, where the first confidence space includes a first confidence that the processed text belongs to a first classification label.
  • the multi-model structure classification voting model is used to classify the processing text. Perform classification to obtain a type result and a confidence level corresponding to the type result, calculate the confidence level generated by the five models through weights, and obtain a first classification label and a first confidence level of the processed text.
  • the confidence levels obtained by the processed text through the five models are [0.8, 0.9, 0.6, 0.5, 0.7], and the weights of the five models are [0.25, 0.2, 0.3, 0.15, 0.1]
  • the first confidence level is obtained as [0.8*0.25+0.9*0.2+0.6*0.3+0.5*0.15+0.7*0.1], that is, the first confidence level is 0.705.
  • the second model analysis module 104 is used for inputting the processed text into the multi-task classification model, and by classifying the processed text in the multi-task classification model, a second confidence space is obtained.
  • the second confidence space includes a second confidence that the processed text belongs to a second classification label.
  • the multi-task classification model includes a classification task and a similarity task.
  • the multi-task classification model is used to analyze the processed text to obtain a similarity set between the processed text and each type of standard sentence and a confidence set corresponding to the similarity, and then filter the similarity set to obtain The type corresponding to the standard sentence with the highest similarity of the processed text is used as the second classification label of the processed text, and according to the second classification label, the confidence level set is queried to obtain the second confidence level.
  • the result processing module 105 is configured to determine, according to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs and the classification confidence corresponding to the classification label.
  • the classification to which the text to be classified belongs can be determined according to the confidence level threshold. Label.
  • the confidence level with the confidence level greater than the confidence level threshold such as 0.8
  • the classification label corresponding to the confidence level as the classification result or select the confidence level from the first confidence level and the second confidence level.
  • the confidence degree greater than the confidence degree threshold such as 0.5
  • the result processing module 104 is specifically configured to:
  • the first classification label is the same as the second classification label
  • determine that the classification label to which the text to be classified belongs is the first classification label or the second classification label, and determine where the text to be classified belongs.
  • the confidence corresponding to the classification label is the average of the first confidence and the second confidence.
  • the predicted result of outputting the processed text is (label 1, confidence level 0.8), and the multi-task classification model prediction result is (label 1, Confidence level 0.7)
  • the prediction type is the same as label 1
  • the confidence levels are added and averaged
  • the type of the text to be classified is determined as label 1
  • the confidence level is 0.75
  • the output (label is 1, confidence level 0.75).
  • the result processing module 104 is also specifically used for:
  • the first confidence level is greater than the second confidence level, determine that the category label to which the text to be classified belongs is the first category label, and the confidence level corresponding to the category result is the first confidence level multiplied by a first coefficient;
  • the first confidence level is not greater than the second confidence level, it is determined that the category label to which the text to be classified belongs is the second category label, and the confidence level corresponding to the category result is the second confidence level multiplied by a second coefficient.
  • the values of the first coefficient and the second coefficient may be the same or different, for example, the values of the first coefficient and the second coefficient are both 0.5.
  • the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8), and the prediction result of the multi-task classification model is (label 2, confidence level 0.9)
  • the prediction result of the multi-task classification model is (label 2, confidence level 0.9)
  • the type result with large degree is multiplied by 0.5
  • the type of the text to be classified is determined as label 2
  • the confidence degree is 0.45
  • the output is (label 2, confidence degree 0.45).
  • the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8)
  • the prediction result of the multi-task classification model is (label 2, confidence level 0.8)
  • the confidence levels are the same, then randomly select A type result, multiply the corresponding confidence by 0.5, and finally determine the type of the text to be classified as label 1 or label 2, the confidence is 0.4, and output (label 1 or 2, confidence 0.4).
  • the apparatus described in the present application may further include a sample adding module, which is configured to add the classification confidence corresponding to the classification label to which the text to be classified belongs to the training sample set.
  • the multi-task classification model/or the multi-model structure classification voting model can be further optimized by continuously expanding the training sample set, which is beneficial to increase the accuracy of the confidence result.
  • the multi-model structure classification voting model and the multi-task classification model are used to classify the text to be classified, respectively, to obtain the first confidence that the processed text belongs to the first classification label and the first confidence that the processed text belongs to the second classification label.
  • Two confidence levels and then determine the classification label of the processing text according to the first confidence level and the second confidence level, which can improve the accuracy of text category judgment by combining different models for category judgment, and in the process of text classification. , without manual intervention, improving the efficiency of text category judgment; at the same time, since the first confidence level that the processed text belongs to the first classification label is obtained by analyzing the processed text through each base model in the multi-model structure classification voting model, many times Analysis to improve the accuracy of text category judgment. Therefore, the text classification device proposed in the present application can achieve the purpose of improving both the reliability of the text classification result and the efficiency of the text classification.
  • FIG. 3 it is a schematic structural diagram of an electronic device implementing a text classification method in the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a text classification program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 .
  • the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as a code of a text classification program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central Processing Unit CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the program) stored in the memory 11. A text classification program, etc.), and call data stored in the memory 11 to perform various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA Extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the figure. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • a text classification program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple computer programs, and when running in the processor 10, can realize:
  • the processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained.
  • the confidence space includes a first confidence that the processed text belongs to the first classification label;
  • the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only) Memory).
  • the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the readable storage medium stores a computer program, and the computer program is stored in the When executed by the processor of the electronic device, it can achieve:
  • the processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained.
  • the confidence space includes a first confidence that the processed text belongs to the first classification label;
  • the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A text classification method, relating to the technical field of natural language processing, comprising: acquiring a multi-model structure classification voting model and a multi-task classification model; preprocessing a text to be classified, to obtain a processed text; inputting the processed text into the multi-model structure classification voting model to obtain first confidence that the processed text relates to a first classification label; inputting the processed text into the multi-task classification model to obtain second confidence that the processed text relates to a second classification label; and determining, according to a first confidence space and a second confidence space, a classification label to which the text to be classified relates and classification confidence corresponding to the classification label. The present method also relates to a blockchain technology, and the confidence spaces can be stored in a blockchain node. The method can improve not only the reliability of text classification results but also the text classification efficiency.

Description

文本分类方法、装置、电子设备及存储介质Text classification method, device, electronic device and storage medium
本申请要求于2021年1月28日提交中国专利局、申请号为CN202110121141.9、名称为“文本分类方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number CN202110121141.9 and titled "Text Classification Method, Apparatus, Electronic Equipment and Storage Medium" filed with the China Patent Office on January 28, 2021, the entire contents of which are by reference Incorporated in this application.
技术领域technical field
本申请涉及自然语言处理技术领域,尤其涉及一种文本分类方法、装置、电子设备及计算机可读存储介质。The present application relates to the technical field of natural language processing, and in particular, to a text classification method, apparatus, electronic device, and computer-readable storage medium.
背景技术Background technique
随着计算机技术的发展,互联网中电子化的文本信息呈现几何级数增长。为了提高信息的利用率,以及基于信息进行评估和预测,常需要对文本进行分类。发明人意识到,现有技术中,为了提高文本分类结果的可靠性,通常还需要进行人工辅助分类,这种方式又常常会降低分类的效率。With the development of computer technology, the electronic text information in the Internet presents a geometric progression. In order to improve the utilization of information, and to evaluate and predict based on the information, it is often necessary to classify text. The inventor realizes that, in the prior art, in order to improve the reliability of the text classification results, it is usually necessary to perform manual assisted classification, which often reduces the efficiency of classification.
发明内容SUMMARY OF THE INVENTION
本申请提供的一种文本分类方法,包括:A text classification method provided by this application includes:
获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到;Obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
获取待分类文本,对所述待分类文本进行预处理,得到处理文本;Obtain the text to be classified, and preprocess the text to be classified to obtain the processed text;
将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度;The processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained. The confidence space includes a first confidence that the processed text belongs to the first classification label;
将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度;Inputting the processed text into the multi-task classification model, and classifying the processed text in the multi-task classification model to obtain a second confidence space, where the second confidence space includes the processed text the second confidence level belonging to the second classification label;
根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。According to the first confidence degree space and the second confidence degree space, the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
本申请还提供一种文本分类装置,所述装置包括:The present application also provides a text classification device, the device comprising:
模型获取模块,用于获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到;A model acquisition module, used for acquiring a multi-model structure classification voting model and a multi-task classification model, the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
文本预处理模块,用于获取待分类文本,对所述待分类文本进行预处理,得到处理文本;a text preprocessing module, used to obtain the text to be classified, and preprocess the text to be classified to obtain the processed text;
第一模型分析模块,用于将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度;The first model analysis module is used to input the processed text into the multi-model structure classification voting model, and classify the processed text through a plurality of base models in the multi-model structure classification voting model to obtain the first a confidence space, where the first confidence space includes a first confidence that the processed text belongs to a first classification label;
第二模型分析模块,用于将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度;The second model analysis module is configured to input the processed text into the multi-task classification model, and obtain a second confidence space by classifying the processed text in the multi-task classification model. The confidence space includes the second confidence that the processed text belongs to the second classification label;
结果处理模块,用于根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。A result processing module, configured to determine, according to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs and the classification confidence corresponding to the classification label.
本申请还提供一种电子设备,所述电子设备包括:The present application also provides an electronic device, the electronic device comprising:
至少一个处理器;以及,at least one processor; and,
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的计算机程序指令,所述计算机程序 指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下步骤:The memory stores computer program instructions executable by the at least one processor, the computer program instructions being executed by the at least one processor to enable the at least one processor to perform the steps of:
获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到;Obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
获取待分类文本,对所述待分类文本进行预处理,得到处理文本;Obtain the text to be classified, and preprocess the text to be classified to obtain the processed text;
将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度;The processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained. The confidence space includes a first confidence that the processed text belongs to the first classification label;
将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度;Inputting the processed text into the multi-task classification model, and classifying the processed text in the multi-task classification model to obtain a second confidence space, where the second confidence space includes the processed text the second confidence level belonging to the second classification label;
根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。According to the first confidence degree space and the second confidence degree space, the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
本申请还提供一种计算机可读存储介质,包括存储数据区和存储程序区,存储数据区存储创建的数据,存储程序区存储有计算机程序;其中,所述计算机程序被处理器执行时实现如下步骤:The present application also provides a computer-readable storage medium, including a storage data area and a storage program area, the storage data area stores created data, and the storage program area stores a computer program; wherein, the computer program is implemented as follows when executed by a processor step:
获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到;Obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
获取待分类文本,对所述待分类文本进行预处理,得到处理文本;Obtain the text to be classified, and preprocess the text to be classified to obtain the processed text;
将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度;The processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained. The confidence space includes a first confidence that the processed text belongs to the first classification label;
将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度;Inputting the processed text into the multi-task classification model, and classifying the processed text in the multi-task classification model to obtain a second confidence space, where the second confidence space includes the processed text the second confidence level belonging to the second classification label;
根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。According to the first confidence degree space and the second confidence degree space, the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
附图说明Description of drawings
图1为本申请一实施例提供的一种文本分类方法的流程示意图;1 is a schematic flowchart of a text classification method provided by an embodiment of the present application;
图2为本申请一实施例提供的一种文本分类装置的模块示意图;FIG. 2 is a schematic block diagram of a text classification apparatus according to an embodiment of the present application;
图3为本申请一实施例提供的实现一种文本分类方法的电子设备的内部结构示意图;3 is a schematic diagram of the internal structure of an electronic device implementing a text classification method provided by an embodiment of the present application;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
本申请实施例提供一种文本分类方法。所述一种文本分类方法的执行主体包括但不限于服务端、终端等能够被配置为执行本申请实施例提供的该方法的电子设备中的至少一种。换言之,所述一种文本分类方法可以由安装在终端设备或服务端设备的软件或硬件来执行,所述软件可以是区块链平台。所述服务端包括但不限于:单台服务器、服务器集群、云端服务器或云端服务器集群等。The embodiment of the present application provides a text classification method. The execution subject of the text classification method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal. In other words, the text classification method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
参照图1所示,为本申请一实施例提供的一种文本分类方法的流程示意图。在本实施例中,所述一种文本分类方法包括:Referring to FIG. 1 , it is a schematic flowchart of a text classification method according to an embodiment of the present application. In this embodiment, the text classification method includes:
S1、获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到。S1. Obtain a multi-model structure classification voting model and a multi-task classification model, where the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set.
本申请实施例中,所述多模型结构分类投票模型是通过预构建的训练样本集训练多个基模型,再将各个基模型的输出结果进行性能排列、权重设置之后得到的。In the embodiment of the present application, the multi-model structure classification voting model is obtained by training multiple base models through a pre-built training sample set, and then performing performance ranking and weight setting on the output results of each base model.
详细地,本申请实施例中,所述获取多模型结构分类投票模型和多任务分类模型之前,所述方法还包括:In detail, in the embodiment of the present application, before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the method further includes:
获取所述训练样本集;obtain the training sample set;
根据随机森林算法和所述训练样本集训练预构建的分类模型,得到多个文本分类模型;Train the pre-built classification model according to the random forest algorithm and the training sample set to obtain multiple text classification models;
利用所述多个文本分类模型构建所述多模型结构分类投票模型。The multi-model structure classification voting model is constructed using the plurality of text classification models.
本申请一可选实施例中,所述分类模型为BERT模型。In an optional embodiment of the present application, the classification model is a BERT model.
具体的,在获取训练样本集之前,包括:获取预构建的语料集合,对所述语料集合进行量化及清洗操作,得到训练样本集。Specifically, before acquiring the training sample set, the method includes: acquiring a pre-built corpus set, and performing quantization and cleaning operations on the corpus set to obtain a training sample set.
其中,所述语料集合为过去已分类处理过的文本,或在网络上获取的类型已分类的语料文本。Wherein, the corpus set is the texts that have been classified and processed in the past, or the corpus texts of which types are obtained on the network and have been classified.
本申请实施例将所述语料集合进行量化操作,得到量化数据,将所述量化数据进行清洗操作,得到所述训练样本集。其中,所述量化操作包括将语料集合中float32数据类型的文本转化为适合文本分类模型训练的uint8数据类型;所述清洗包括将所述量化数据进行去重、填补空值。In this embodiment of the present application, a quantization operation is performed on the corpus set to obtain quantified data, and a cleaning operation is performed on the quantified data to obtain the training sample set. Wherein, the quantization operation includes converting the text of the float32 data type in the corpus set into a uint8 data type suitable for training a text classification model; the cleaning includes deduplicating the quantized data and filling empty values.
本申请实施例中,通过对语料集合进行量化和清洗操作,可以得到向量化、且结构完整的数据,因此,可以使得训练过程更加高效。In the embodiment of the present application, by performing quantization and cleaning operations on the corpus set, vectorized data with complete structure can be obtained, and therefore, the training process can be made more efficient.
所述随机森林算法是一种用于分类的集成学习算法。The random forest algorithm is an ensemble learning algorithm for classification.
具体的,本申请实施例利用所述随机森林算法,预设数值Q次有放回的在所述训练样本集中随机抽取25%的数据对所述文本分类模型进行训练,得到预设数值Q个所述文本分类模型。Specifically, in this embodiment of the present application, the random forest algorithm is used to randomly select 25% of the data from the training sample set with the preset value Q times with replacement to train the text classification model, and obtain Q preset values. the text classification model.
本申请一可选实施例中,所述预设数值Q为5。In an optional embodiment of the present application, the preset value Q is 5.
进一步的,本申请实施例中,所述利用所述多个文本分类模型构建所述多模型结构分类投票模型,包括:Further, in the embodiment of the present application, the construction of the multi-model structure classification voting model by using the multiple text classification models includes:
利用所述多个文本分类模型对预构建的模型测试样本进行分类,得到分类结果及所述分类结果对应的置信度;Classify the pre-built model test samples by using the multiple text classification models to obtain a classification result and a confidence level corresponding to the classification result;
根据所述置信度的大小对所述多个文本分类模型进行排序,得到基模型排序表;Sorting the multiple text classification models according to the size of the confidence, to obtain a base model sorting table;
根据所述基模型排序表对所述多个文本分类模型按照预设的权重梯度值进行权重设置,得到所述多模型结构分类投票模型。According to the base model ranking table, weights are set for the multiple text classification models according to preset weight gradient values, so as to obtain the multi-model structure classification voting model.
本申请一实施例中,所述多模型结构分类投票模型的置信度公式为:In an embodiment of the present application, the confidence formula of the multi-model structure classification voting model is:
Figure PCTCN2021083560-appb-000001
Figure PCTCN2021083560-appb-000001
其中,p q为第q个所述文本分类模型的权重,y q(x)为第q个所述文本分类模型的置信度结果。 Wherein, p q is the weight of the q-th text classification model, and y q (x) is the confidence result of the q-th text classification model.
详细地,所述模型测试样本为已确定类型的文本。Specifically, the model testing samples are texts of a determined type.
例如,利用所述模型测试样本,对已构建的5个文本分类模型进行测试,5个文本分类模型得到的分析结果为:[模型1:负情绪类,置信度90%;模型2:负情绪类,置信度86%;模型3:负情绪类,置信度96%;模型4:负情绪类,置信度82%;模型5:负情绪类,置信度79%],则将5个文本分类模型,按置信度排列得到所述基模型置信度表[模型3,模型1;模型2;模型4;模型5]。再根据所述基模型置信度表,进行权重分配为[模型3:权重0.3,模型1:权重0.25;模型2:权重0.2;模型4:权重0.15;模型5:权重0.1],根据所述权重将5个文本分类模型进行组合,得到所述多模型结构分类投票模型。For example, using the model test sample to test 5 text classification models that have been constructed, the analysis results obtained by the 5 text classification models are: [Model 1: Negative emotion class, confidence level 90%; Model 2: Negative emotion Class, confidence level 86%; Model 3: Negative emotion class, confidence level 96%; Model 4: Negative emotion class, confidence level 82%; Model 5: Negative emotion class, confidence level 79%], then classify 5 texts model, and the base model confidence level table [Model 3, Model 1; Model 2; Model 4; Model 5] is obtained by arranging according to the confidence level. Then, according to the confidence level table of the base model, the weights are allocated as [Model 3: Weight 0.3, Model 1: Weight 0.25; Model 2: Weight 0.2; Model 4: Weight 0.15; Model 5: Weight 0.1], according to the weight The five text classification models are combined to obtain the multi-model structure classification voting model.
详细地,本申请实施例中,所述获取多模型结构分类投票模型和多任务分类模型之前,所述方法还包括:In detail, in the embodiment of the present application, before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the method further includes:
将所述分类模型中的分类损失与预构建的相似度损失进行组合,得到改进损失,将所 述分类模型中的分类损失替换为所述改进损失,得到优化分类模型;Combining the classification loss in the classification model with the pre-built similarity loss to obtain an improved loss, and replacing the classification loss in the classification model with the improved loss to obtain an optimized classification model;
利用所述优化分类模型中的特征提取神经网络对所述训练样本集进行特征提取,得到语句向量;Use the feature extraction neural network in the optimized classification model to perform feature extraction on the training sample set to obtain a sentence vector;
通过所述语句向量对所述优化分类模型进行训练,直至所述所述优化分类模型的改进损失的下降梯度在预设的训练步骤内小于预设损失阈值时,得到所述多任务分类模型。The optimized classification model is trained by using the sentence vector, and the multi-task classification model is obtained when the descending gradient of the improved loss of the optimized classification model is smaller than a preset loss threshold within a preset training step.
详细地,本申请实施例中,所述训练样本集包括语料以外还包括不同类型的标准句,所述预构建的相似损失为:In detail, in the embodiment of the present application, the training sample set includes different types of standard sentences in addition to the corpus, and the pre-built similarity loss is:
Figure PCTCN2021083560-appb-000002
Figure PCTCN2021083560-appb-000002
其中,所述N为训练样本集中标准句的数量,训练样本中每个标准句代表一个类型,
Figure PCTCN2021083560-appb-000003
为指定语料的语句向量,x j为标准句的语句向量。
Wherein, the N is the number of standard sentences in the training sample set, and each standard sentence in the training sample represents a type,
Figure PCTCN2021083560-appb-000003
is the sentence vector of the specified corpus, and x j is the sentence vector of the standard sentence.
本申请实施例中,得到的改进损失为:In the embodiment of the present application, the obtained improvement loss is:
Figure PCTCN2021083560-appb-000004
Figure PCTCN2021083560-appb-000004
其中,c为标准句的类别;y c表示类别c的指示变量,如果类别c与利用优化分类模型得到的分类结果相同,则y c为1,否则y c为0;p c表示类别c的预测概率,w i,w j为置信度损失及相似度损失各自的权重。 Among them, c is the category of the standard sentence; y c represents the indicator variable of category c , if category c is the same as the classification result obtained by using the optimized classification model, then y c is 1, otherwise y c is 0; Prediction probability, w i , w j are the respective weights of confidence loss and similarity loss.
本申请实施例中,各语料所属的分类标签的置信度计算公式为:In the embodiment of the present application, the confidence calculation formula of the classification label to which each corpus belongs is:
Figure PCTCN2021083560-appb-000005
Figure PCTCN2021083560-appb-000005
其中z j为语料中第j个短句的分类结果,即第j个短句所述的分类标签,K为分类结果的个数。 Where z j is the classification result of the jth short sentence in the corpus, that is, the classification label described by the jth short sentence, and K is the number of classification results.
具体实施时,本申请实施例可以通过孪生网络,且利用所述语句向量不断训练优化分类模型,训练优化分类模型的过程中,不断最小化改进损失,直至所述优化分类模型的改进损失的下降梯度在预设的训练步骤内小于预设损失阈值时,停止训练过程,得到所述多任务分类模型。In specific implementation, the embodiment of the present application can continuously train and optimize the classification model by using the sentence vector through the twin network. During the process of training the optimized classification model, the improvement loss is continuously minimized until the improvement loss of the optimized classification model decreases. When the gradient is smaller than the preset loss threshold within the preset training steps, the training process is stopped, and the multi-task classification model is obtained.
S2、获取待分类文本,对所述待分类文本进行预处理,得到处理文本。S2. Acquire the text to be classified, and preprocess the text to be classified to obtain the processed text.
本申请实施例可以利用预构建的召回引擎从互联网或本地存储空间获取所述待分类文本。In this embodiment of the present application, a pre-built recall engine may be used to obtain the text to be classified from the Internet or a local storage space.
详细地,本申请实施例中,所述S2包括:In detail, in this embodiment of the present application, the S2 includes:
对所述待分类文本进行标点符合分割或句子长度分割,得到所述处理文本。Punctuation coincidence segmentation or sentence length segmentation is performed on the text to be classified to obtain the processed text.
具体的,当所述待分类文本的体积小于512字符时,对所述待分类文本进行标点符合分割,即根据标点符号对待分类文本进行划分;当所述待分类文本的体积大于512字符时,所述待分类文本进行句子长度分割,例如,将所述待分类文本随机分为体积都小于521字符的处理文本。Specifically, when the volume of the text to be classified is less than 512 characters, the text to be classified is segmented according to punctuation, that is, the text to be classified is divided according to punctuation; when the volume of the text to be classified is greater than 512 characters, The to-be-classified text is segmented by sentence length, for example, the to-be-classified text is randomly divided into processing texts whose volume is less than 521 characters.
S3、将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度。S3. Input the processed text into the multi-model structure classification voting model, and classify the processed text through a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the The first confidence space includes a first confidence that the processed text belongs to a first classification label.
本申请实施例利用所述多模型结构分类投票模型对所述处理文本进行分类,例如,具体通过所述多模型结构分类投票模型中前述模型1至前述模型5的五个模型对所述处理文本进行分类得到类型结果,及所述类型结果对应的置信度,通过权重计算所述五个模型产生的置信度,得到所述处理文本的第一分类标签及第一置信度。In this embodiment of the present application, the multi-model structure classification voting model is used to classify the processing text. Perform classification to obtain a type result and a confidence level corresponding to the type result, calculate the confidence level generated by the five models through weights, and obtain a first classification label and a first confidence level of the processed text.
具体的,所述处理文本经过所述五个模型分别得到的置信度为[0.8,0.9,0.6,0.5,0.7],所述五个模型的权重为[0.25,0.2,0.3,0.15,0.1],则得到所述第一置信度为[0.8*0.25+0.9*0.2+0.6*0.3+0.5*0.15+0.7*0.1],即第一置信度为0.705。Specifically, the confidence levels obtained by the processed text through the five models are [0.8, 0.9, 0.6, 0.5, 0.7], and the weights of the five models are [0.25, 0.2, 0.3, 0.15, 0.1] , the first confidence level is obtained as [0.8*0.25+0.9*0.2+0.6*0.3+0.5*0.15+0.7*0.1], that is, the first confidence level is 0.705.
S4、将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度。S4. Input the processed text into the multi-task classification model, and classify the processed text in the multi-task classification model to obtain a second confidence space, where the second confidence space includes the A second confidence that the processed text belongs to the second classification label.
本申请实施例中,多任务分类模型包括分类任务和相似度任务。In this embodiment of the present application, the multi-task classification model includes a classification task and a similarity task.
具体的,利用所述多任务分类模型对所述处理文本进行分析,得到所述处理文本与各个类型标准句的相似度集合与相似度对应的置信度集合,进而筛选所述相似度集合,得到与所述处理文本相似度最高的标准句对应的类型,作为所述处理文本的第二分类标签,并根据所述第二分类标签,查询所述置信度集合,得到第二置信度。Specifically, the multi-task classification model is used to analyze the processed text to obtain a similarity set between the processed text and each type of standard sentence and a confidence set corresponding to the similarity, and then filter the similarity set to obtain The type corresponding to the standard sentence with the highest similarity of the processed text is used as the second classification label of the processed text, and according to the second classification label, the confidence level set is queried to obtain the second confidence level.
S5、根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。S5. Determine, according to the first confidence space and the second confidence space, a classification label to which the text to be classified belongs and a classification confidence corresponding to the classification label.
本申请实施例,根据所述多模型结构分类投票模型及所述多任务分类模型,得到处理文本所属的分类标签及置信度后,根据业务场景,可以根据置信度阈值确定待分类文本所属的分类标签。In the embodiment of the present application, according to the multi-model structure classification voting model and the multi-task classification model, after obtaining the classification label and confidence level to which the processed text belongs, according to the business scenario, the classification to which the text to be classified belongs can be determined according to the confidence level threshold. Label.
例如,从第一置信度和第二置信度中选取置信度大于置信度阈值(如0.8)的置信度及该置信度对应的分类标签为分类结果,或者,从第一置信度和第二置信度中选取置信度大于置信度阈值(如0.5)的置信度及该置信度对应的分类标签为分类结果。For example, from the first confidence level and the second confidence level, select the confidence level with the confidence level greater than the confidence level threshold (such as 0.8) and the classification label corresponding to the confidence level as the classification result, or select the confidence level from the first confidence level and the second confidence level. Select the confidence degree greater than the confidence degree threshold (such as 0.5) and the classification label corresponding to the confidence degree as the classification result.
详细地,本申请实施例中,所述S5包括:In detail, in this embodiment of the present application, the S5 includes:
当所述第一分类标签与所述第二分类标签相同时,确定所述待分类文本所属的分类标签为所述第一分类标签或所述第二分类标签,以及确定所述待分类文本所述的分类标签对应的置信度为所述第一置信度及所述第二置信度的平均值。When the first classification label is the same as the second classification label, determine that the classification label to which the text to be classified belongs is the first classification label or the second classification label, and determine where the text to be classified belongs. The confidence corresponding to the classification label is the average of the first confidence and the second confidence.
例如,当所述处理文本,在所述多模型结构分类投票模型中,输出所述处理文本的预测结果是(标签1,置信度0.8),所述多任务分类模型预测结果为(标签1,置信度0.7),预测类型相同都为标签1,则置信度相加取平均,最后确定待分类文本的类型为标签1,置信度为0.75,输出(标签为1,置信度0.75)。For example, when the processing text, in the multi-model structure classification voting model, the predicted result of outputting the processed text is (label 1, confidence level 0.8), and the multi-task classification model prediction result is (label 1, The confidence level is 0.7), and the prediction type is the same as label 1, then the confidence level is added and averaged, and finally the type of the text to be classified is determined as label 1, the confidence level is 0.75, and the output (label is 1, confidence level 0.75).
详细地,本申请实施例中,所述S5还包括:In detail, in the embodiment of the present application, the S5 further includes:
当所述第一分类标签与所述第二分类标签不相同时,判断所述第一置信度是否大于所述第二置信度;When the first classification label is different from the second classification label, determine whether the first confidence level is greater than the second confidence level;
若所述第一置信度大于所述第二置信度,确定所述待分类文本所属的分类标签为第一分类标签,类别结果对应的置信度为所述第一置信度乘以第一系数;If the first confidence level is greater than the second confidence level, determine that the category label to which the text to be classified belongs is the first category label, and the confidence level corresponding to the category result is the first confidence level multiplied by a first coefficient;
若所述第一置信度不大于所述第二置信度,确定所述待分类文本所属的分类标签为第二分类标签,类别结果对应置信度为第二置信度乘以第二系数。If the first confidence level is not greater than the second confidence level, it is determined that the category label to which the text to be classified belongs is the second category label, and the confidence level corresponding to the category result is the second confidence level multiplied by a second coefficient.
所述第一系数和第二系数的值可以为相同的,或者是不同的,例如,第一系数和第二系数的值都为0.5。The values of the first coefficient and the second coefficient may be the same or different, for example, the values of the first coefficient and the second coefficient are both 0.5.
例如,当多模型结构分类投票模型输出的所述处理文本的预测结果是(标签1,置信度0.8),所述多任务分类模型的预测结果是(标签2,置信度0.9),则取置信度大的类型结果,且将对应的置信度乘以0.5,最后确定待分类文本的类型为标签2,置信度为0.45,输出(标签2,置信度0.45)。For example, when the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8), and the prediction result of the multi-task classification model is (label 2, confidence level 0.9), then take the confidence level The type result with large degree is multiplied by 0.5, and finally the type of the text to be classified is determined as label 2, the confidence degree is 0.45, and the output is (label 2, confidence degree 0.45).
当多模型结构分类投票模型输出的所述处理文本的预测结果是(标签1,置信度0.8),多任务分类模型的预测结果是(标签2,置信度0.8),置信度相同,则随机选取一个类型结果,并将对应的置信度乘以0.5,最后确定待分类文本的类型为标签1或标签2,置信度为0.4,输出(标签1或2,置信度0.4)。When the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8), the prediction result of the multi-task classification model is (label 2, confidence level 0.8), and the confidence levels are the same, then randomly select A type result, multiply the corresponding confidence by 0.5, and finally determine the type of the text to be classified as label 1 or label 2, the confidence is 0.4, and output (label 1 or 2, confidence 0.4).
进一步的,本申请其他可选实施例中,还可以将所述所述待分类文本所属的分类标签对应的分类置信度添加至所述训练样本集。Further, in other optional embodiments of the present application, the classification confidence level corresponding to the classification label to which the text to be classified belongs may also be added to the training sample set.
通过不断扩充训练样本集可以进一步优化所述多任务分类模型/或多模型结构分类投票模型,有利于增加置信度结果的准确性。The multi-task classification model/or the multi-model structure classification voting model can be further optimized by continuously expanding the training sample set, which is beneficial to increase the accuracy of the confidence result.
本申请实施例通过多模型结构分类投票模型和多任务分类模型分别对待分类文本进行分类,得到所述处理文本属于第一分类标签的第一置信度和所述处理文本属于第二分类标签的第二置信度,再根据第一置信度和第二置信度确定处理文本所述的分类标签,能够通过综合不同模型的进行类别判断,提高文本类别判断的准确性,且在对文本分类的过程中,无需人工干预,提高文本类别判断的效率;同时,由于处理文本属于第一分类标签的第一置信度是通过多模型结构分类投票模型中的各个基模型对处理文本进行分析得到的,多次分析,提高文本类别判断的的准确性。因此,本申请提出的文本分类方法可以实现既提高文本分类结果的可靠性又提高文本分类的效率的目的。In this embodiment of the present application, the multi-model structure classification voting model and the multi-task classification model are used to classify the text to be classified, respectively, to obtain the first confidence that the processed text belongs to the first classification label and the first confidence that the processed text belongs to the second classification label. Two confidence levels, and then determine the classification label described in the processing text according to the first confidence level and the second confidence level, which can improve the accuracy of text category judgment by combining different models for category judgment, and in the process of text classification. , without manual intervention, improving the efficiency of text category judgment; at the same time, because the first confidence level that the processed text belongs to the first classification label is obtained by analyzing the processed text through each base model in the multi-model structure classification voting model, many times Analysis to improve the accuracy of text category judgment. Therefore, the text classification method proposed in this application can achieve the purpose of improving both the reliability of the text classification result and the efficiency of the text classification.
如图2所示,是本申请一种文本分类装置的模块示意图。As shown in FIG. 2 , it is a schematic diagram of a module of a text classification apparatus of the present application.
本申请所述一种文本分类装置100可以安装于电子设备中。根据实现的功能,所述一种文本分类装置可以包括模型获取模块101、文本预处理模块102、第一模型分析模块103、第二模型分析模块104和结果处理模块105。本申请所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。The text classification apparatus 100 described in this application can be installed in an electronic device. According to the implemented functions, the text classification apparatus may include a model acquisition module 101 , a text preprocessing module 102 , a first model analysis module 103 , a second model analysis module 104 and a result processing module 105 . The modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
在本实施例中,关于各模块/单元的功能如下:In this embodiment, the functions of each module/unit are as follows:
所述模型获取模块101,用于获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到。The model obtaining module 101 is configured to obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set.
本申请实施例中,所述多模型结构分类投票模型是通过预构建的训练样本集训练多个基模型,再将各个基模型的输出结果进行性能排列、权重设置之后得到的。In the embodiment of the present application, the multi-model structure classification voting model is obtained by training multiple base models through a pre-built training sample set, and then performing performance ranking and weight setting on the output results of each base model.
详细地,本申请实施例中,所述装置还包括多模型结构分类投票模型构建模块,所述多模型结构分类投票模型构建模块包括:In detail, in the embodiment of the present application, the device further includes a multi-model structure classification voting model building module, and the multi-model structure classification voting model building module includes:
获取单元,用于获取所述训练样本集;an acquisition unit for acquiring the training sample set;
第一训练单元,用于根据随机森林算法和所述训练样本集训练预构建的分类模型,得到多个文本分类模型;a first training unit, used for training a pre-built classification model according to the random forest algorithm and the training sample set, to obtain a plurality of text classification models;
构建单元,用于利用所述多个文本分类模型构建所述多模型结构分类投票模型。A construction unit, configured to construct the multi-model structure classification voting model by using the plurality of text classification models.
本申请一可选实施例中,所述分类模型为BERT模型。In an optional embodiment of the present application, the classification model is a BERT model.
具体的,所述获取单元包括:在获取训练样本集之前,获取预构建的语料集合,对所述语料集合进行量化及清洗操作,得到训练样本集。Specifically, the obtaining unit includes: before obtaining the training sample set, obtaining a pre-built corpus set, and performing quantification and cleaning operations on the corpus set to obtain the training sample set.
其中,所述语料集合为过去已分类处理过的文本,或在网络上获取的类型已分类的语料文本。Wherein, the corpus set is the texts that have been classified and processed in the past, or the corpus texts of which types are obtained on the network and have been classified.
本申请实施例将所述语料集合进行量化操作,得到量化数据,将所述量化数据进行清洗操作,得到所述训练样本集。其中,所述量化操作包括将语料集合中float32数据类型的文本转化为适合文本分类模型训练的uint8数据类型;所述清洗包括将所述量化数据进行去重、填补空值。In this embodiment of the present application, a quantization operation is performed on the corpus set to obtain quantified data, and a cleaning operation is performed on the quantified data to obtain the training sample set. Wherein, the quantization operation includes converting the text of the float32 data type in the corpus set into a uint8 data type suitable for training a text classification model; the cleaning includes deduplicating the quantized data and filling empty values.
本申请实施例中,通过对语料集合进行量化和清洗操作,可以得到向量化、且结构完整的数据,因此,可以使得训练过程更加高效。In the embodiment of the present application, by performing quantization and cleaning operations on the corpus set, vectorized data with complete structure can be obtained, and therefore, the training process can be made more efficient.
所述随机森林算法是一种用于分类的集成学习算法。The random forest algorithm is an ensemble learning algorithm for classification.
具体的,本申请实施例利用所述随机森林算法,预设数值Q次有放回的在所述训练样本集中随机抽取25%的数据对所述文本分类模型进行训练,得到预设数值Q个所述文本分类模型。Specifically, in this embodiment of the present application, the random forest algorithm is used to randomly select 25% of the data from the training sample set with the preset value Q times with replacement to train the text classification model, and obtain Q preset values. the text classification model.
本申请一可选实施例中,所述预设数值Q为5。In an optional embodiment of the present application, the preset value Q is 5.
进一步的,本申请实施例中,所述构建单元具体用于:Further, in the embodiment of the present application, the construction unit is specifically used for:
利用所述多个文本分类模型对预构建的模型测试样本进行分类,得到分类结果及所述分类结果对应的置信度;Classify the pre-built model test samples by using the multiple text classification models to obtain a classification result and a confidence level corresponding to the classification result;
根据所述置信度的大小对所述多个文本分类模型进行排序,得到基模型排序表;Sorting the multiple text classification models according to the size of the confidence, to obtain a base model sorting table;
根据所述基模型排序表对所述多个文本分类模型按照预设的权重梯度值进行权重设置,得到所述多模型结构分类投票模型。According to the base model ranking table, weights are set for the multiple text classification models according to preset weight gradient values, so as to obtain the multi-model structure classification voting model.
本申请一实施例中,所述多模型结构分类投票模型的置信度公式为:In an embodiment of the present application, the confidence formula of the multi-model structure classification voting model is:
Figure PCTCN2021083560-appb-000006
Figure PCTCN2021083560-appb-000006
其中,p q为第q个所述文本分类模型的权重,y q(x)为第q个所述文本分类模型的置信度结果。 Wherein, p q is the weight of the q-th text classification model, and y q (x) is the confidence result of the q-th text classification model.
详细地,所述模型测试样本为已确定类型的文本。Specifically, the model testing samples are texts of a determined type.
例如,利用所述模型测试样本,对已构建的5个文本分类模型进行测试,5个文本分类模型得到的分析结果为:[模型1:负情绪类,置信度90%;模型2:负情绪类,置信度86%;模型3:负情绪类,置信度96%;模型4:负情绪类,置信度82%;模型5:负情绪类,置信度79%],则将5个文本分类模型,按置信度排列得到所述基模型置信度表[模型3,模型1;模型2;模型4;模型5]。再根据所述基模型置信度表,进行权重分配为[模型3:权重0.3,模型1:权重0.25;模型2:权重0.2;模型4:权重0.15;模型5:权重0.1],根据所述权重将5个文本分类模型进行组合,得到所述多模型结构分类投票模型。For example, using the model test sample to test 5 text classification models that have been constructed, the analysis results obtained by the 5 text classification models are: [Model 1: Negative emotion class, confidence level 90%; Model 2: Negative emotion Class, confidence level 86%; Model 3: Negative emotion class, confidence level 96%; Model 4: Negative emotion class, confidence level 82%; Model 5: Negative emotion class, confidence level 79%], then classify 5 texts model, and the base model confidence level table [Model 3, Model 1; Model 2; Model 4; Model 5] is obtained by arranging according to the confidence level. Then, according to the confidence level table of the base model, the weights are allocated as [Model 3: Weight 0.3, Model 1: Weight 0.25; Model 2: Weight 0.2; Model 4: Weight 0.15; Model 5: Weight 0.1], according to the weight The five text classification models are combined to obtain the multi-model structure classification voting model.
详细地,本申请实施例中,所述装置还包括多任务分类模型获取模块,所述多任务分类模型获取模块包括:In detail, in the embodiment of the present application, the device further includes a multi-task classification model acquisition module, and the multi-task classification model acquisition module includes:
优化分类模型获取单元,用于将所述分类模型中的分类损失与预构建的相似度损失进行组合,得到改进损失,将所述分类模型中的分类损失替换为所述改进损失,得到优化分类模型;An optimized classification model acquisition unit, configured to combine the classification loss in the classification model with the pre-built similarity loss to obtain an improved loss, and replace the classification loss in the classification model with the improved loss to obtain an optimized classification Model;
特征提取单元,用于利用所述优化分类模型中的特征提取神经网络对所述训练样本集进行特征提取,得到语句向量;a feature extraction unit, configured to perform feature extraction on the training sample set by utilizing the feature extraction neural network in the optimized classification model to obtain a sentence vector;
第二训练单元,用于通过所述语句向量对所述优化分类模型进行训练,直至所述所述优化分类模型的改进损失的下降梯度在预设的训练步骤内小于预设损失阈值时,得到所述多任务分类模型。The second training unit is configured to train the optimized classification model by using the sentence vector, until the descending gradient of the improved loss of the optimized classification model is smaller than a preset loss threshold within a preset training step, obtain The multi-task classification model.
详细地,本申请实施例中,所述训练样本集包括语料以外还包括不同类型的标准句,所述预构建的相似损失为:In detail, in the embodiment of the present application, the training sample set includes different types of standard sentences in addition to the corpus, and the pre-built similarity loss is:
Figure PCTCN2021083560-appb-000007
Figure PCTCN2021083560-appb-000007
其中,所述N为训练样本集中标准句的数量,训练样本中每个标准句代表一个类型,
Figure PCTCN2021083560-appb-000008
为指定语料的语句向量,x j为标准句的语句向量。
Wherein, the N is the number of standard sentences in the training sample set, and each standard sentence in the training sample represents a type,
Figure PCTCN2021083560-appb-000008
is the sentence vector of the specified corpus, and x j is the sentence vector of the standard sentence.
本申请实施例中,得到的改进损失为:In the embodiment of the present application, the obtained improvement loss is:
Figure PCTCN2021083560-appb-000009
Figure PCTCN2021083560-appb-000009
其中,c为标准句的类别;y c表示类别c的指示变量,如果类别c与利用优化分类模型得到的分类结果相同,则y c为1,否则y c为0;p c表示类别c的预测概率,w i,w j为置信度损失及相似度损失各自的权重。 Among them, c is the category of the standard sentence; y c represents the indicator variable of category c , if category c is the same as the classification result obtained by using the optimized classification model, then y c is 1, otherwise y c is 0; Prediction probability, w i , w j are the respective weights of confidence loss and similarity loss.
本申请实施例中,各语料所属的分类标签的置信度计算公式为:In the embodiment of the present application, the confidence calculation formula of the classification label to which each corpus belongs is:
Figure PCTCN2021083560-appb-000010
Figure PCTCN2021083560-appb-000010
其中z j为语料中第j个短句的分类结果,即第j个短句所述的分类标签,K为分类结果的个数。 Where z j is the classification result of the jth short sentence in the corpus, that is, the classification label described by the jth short sentence, and K is the number of classification results.
具体实施时,本申请实施例可以通过孪生网络,且利用所述语句向量不断训练优化分 类模型,训练优化分类模型的过程中,不断最小化改进损失,直至所述优化分类模型的改进损失的下降梯度在预设的训练步骤内小于预设损失阈值时,停止训练过程,得到所述多任务分类模型。In specific implementation, the embodiment of the present application can continuously train and optimize the classification model by using the sentence vector through the twin network. During the process of training the optimized classification model, the improvement loss is continuously minimized until the improvement loss of the optimized classification model decreases. When the gradient is smaller than the preset loss threshold within the preset training steps, the training process is stopped, and the multi-task classification model is obtained.
所述文本预处理模块102,用于获取待分类文本,对所述待分类文本进行预处理,得到处理文本。The text preprocessing module 102 is configured to acquire the text to be classified, and preprocess the text to be classified to obtain the processed text.
本申请实施例可以利用预构建的召回引擎从互联网或本地存储空间获取所述待分类文本。In this embodiment of the present application, a pre-built recall engine may be used to obtain the text to be classified from the Internet or a local storage space.
详细地,所述文本预处理模块102具体用于:In detail, the text preprocessing module 102 is specifically used for:
对所述待分类文本进行标点符合分割或句子长度分割,得到所述处理文本。Punctuation coincidence segmentation or sentence length segmentation is performed on the text to be classified to obtain the processed text.
具体的,当所述待分类文本的体积小于512字符时,对所述待分类文本进行标点符合分割,即根据标点符号对待分类文本进行划分;当所述待分类文本的体积大于512字符时,所述待分类文本进行句子长度分割,例如,将所述待分类文本随机分为体积都小于521字符的处理文本。Specifically, when the volume of the text to be classified is less than 512 characters, the text to be classified is segmented according to punctuation, that is, the text to be classified is divided according to punctuation; when the volume of the text to be classified is greater than 512 characters, The to-be-classified text is segmented by sentence length, for example, the to-be-classified text is randomly divided into processing texts whose volume is less than 521 characters.
所述第一模型分析模块103,用于将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度。The first model analysis module 103 is configured to input the processed text into the multi-model structure classification voting model, and classify the processed text through a plurality of base models in the multi-model structure classification voting model, A first confidence space is obtained, where the first confidence space includes a first confidence that the processed text belongs to a first classification label.
本申请实施例利用所述多模型结构分类投票模型对所述处理文本进行分类,例如,具体通过所述多模型结构分类投票模型中前述模型1至前述模型5的五个模型对所述处理文本进行分类得到类型结果,及所述类型结果对应的置信度,通过权重计算所述五个模型产生的置信度,得到所述处理文本的第一分类标签及第一置信度。In this embodiment of the present application, the multi-model structure classification voting model is used to classify the processing text. Perform classification to obtain a type result and a confidence level corresponding to the type result, calculate the confidence level generated by the five models through weights, and obtain a first classification label and a first confidence level of the processed text.
具体的,所述处理文本经过所述五个模型分别得到的置信度为[0.8,0.9,0.6,0.5,0.7],所述五个模型的权重为[0.25,0.2,0.3,0.15,0.1],则得到所述第一置信度为[0.8*0.25+0.9*0.2+0.6*0.3+0.5*0.15+0.7*0.1],即第一置信度为0.705。Specifically, the confidence levels obtained by the processed text through the five models are [0.8, 0.9, 0.6, 0.5, 0.7], and the weights of the five models are [0.25, 0.2, 0.3, 0.15, 0.1] , the first confidence level is obtained as [0.8*0.25+0.9*0.2+0.6*0.3+0.5*0.15+0.7*0.1], that is, the first confidence level is 0.705.
所述第二模型分析模块104,用于将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度。The second model analysis module 104 is used for inputting the processed text into the multi-task classification model, and by classifying the processed text in the multi-task classification model, a second confidence space is obtained. The second confidence space includes a second confidence that the processed text belongs to a second classification label.
本申请实施例中,多任务分类模型包括分类任务和相似度任务。In this embodiment of the present application, the multi-task classification model includes a classification task and a similarity task.
具体的,利用所述多任务分类模型对所述处理文本进行分析,得到所述处理文本与各个类型标准句的相似度集合与相似度对应的置信度集合,进而筛选所述相似度集合,得到与所述处理文本相似度最高的标准句对应的类型,作为所述处理文本的第二分类标签,并根据所述第二分类标签,查询所述置信度集合,得到第二置信度。Specifically, the multi-task classification model is used to analyze the processed text to obtain a similarity set between the processed text and each type of standard sentence and a confidence set corresponding to the similarity, and then filter the similarity set to obtain The type corresponding to the standard sentence with the highest similarity of the processed text is used as the second classification label of the processed text, and according to the second classification label, the confidence level set is queried to obtain the second confidence level.
所述结果处理模块105,用于根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。The result processing module 105 is configured to determine, according to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs and the classification confidence corresponding to the classification label.
本申请实施例,根据所述多模型结构分类投票模型及所述多任务分类模型,得到处理文本所属的分类标签及置信度后,根据业务场景,可以根据置信度阈值确定待分类文本所属的分类标签。In the embodiment of the present application, according to the multi-model structure classification voting model and the multi-task classification model, after obtaining the classification label and confidence level to which the processed text belongs, according to the business scenario, the classification to which the text to be classified belongs can be determined according to the confidence level threshold. Label.
例如,从第一置信度和第二置信度中选取置信度大于置信度阈值(如0.8)的置信度及该置信度对应的分类标签为分类结果,或者,从第一置信度和第二置信度中选取置信度大于置信度阈值(如0.5)的置信度及该置信度对应的分类标签为分类结果。For example, from the first confidence level and the second confidence level, select the confidence level with the confidence level greater than the confidence level threshold (such as 0.8) and the classification label corresponding to the confidence level as the classification result, or select the confidence level from the first confidence level and the second confidence level. Select the confidence degree greater than the confidence degree threshold (such as 0.5) and the classification label corresponding to the confidence degree as the classification result.
详细地,本申请实施例中,所述结果处理模块104具体用于:In detail, in this embodiment of the present application, the result processing module 104 is specifically configured to:
当所述第一分类标签与所述第二分类标签相同时,确定所述待分类文本所属的分类标签为所述第一分类标签或所述第二分类标签,以及确定所述待分类文本所述的分类标签对应的置信度为所述第一置信度及所述第二置信度的平均值。When the first classification label is the same as the second classification label, determine that the classification label to which the text to be classified belongs is the first classification label or the second classification label, and determine where the text to be classified belongs. The confidence corresponding to the classification label is the average of the first confidence and the second confidence.
例如,当所述处理文本,在所述多模型结构分类投票模型中,输出所述处理文本的预测结果是(标签1,置信度0.8),所述多任务分类模型预测结果为(标签1,置信度0.7), 预测类型相同都为标签1,则置信度相加取平均,最后确定待分类文本的类型为标签1,置信度为0.75,输出(标签为1,置信度0.75)。For example, when the processing text, in the multi-model structure classification voting model, the predicted result of outputting the processed text is (label 1, confidence level 0.8), and the multi-task classification model prediction result is (label 1, Confidence level 0.7), if the prediction type is the same as label 1, the confidence levels are added and averaged, and finally the type of the text to be classified is determined as label 1, the confidence level is 0.75, and the output (label is 1, confidence level 0.75).
详细地,本申请实施例中,所述结果处理模块104还具体用于:In detail, in this embodiment of the present application, the result processing module 104 is also specifically used for:
当所述第一分类标签与所述第二分类标签不相同时,判断所述第一置信度是否大于所述第二置信度;When the first classification label is different from the second classification label, determine whether the first confidence level is greater than the second confidence level;
若所述第一置信度大于所述第二置信度,确定所述待分类文本所属的分类标签为第一分类标签,类别结果对应的置信度为所述第一置信度乘以第一系数;If the first confidence level is greater than the second confidence level, determine that the category label to which the text to be classified belongs is the first category label, and the confidence level corresponding to the category result is the first confidence level multiplied by a first coefficient;
若所述第一置信度不大于所述第二置信度,确定所述待分类文本所属的分类标签为第二分类标签,类别结果对应置信度为第二置信度乘以第二系数。If the first confidence level is not greater than the second confidence level, it is determined that the category label to which the text to be classified belongs is the second category label, and the confidence level corresponding to the category result is the second confidence level multiplied by a second coefficient.
所述第一系数和第二系数的值可以为相同的,或者是不同的,例如,第一系数和第二系数的值都为0.5。The values of the first coefficient and the second coefficient may be the same or different, for example, the values of the first coefficient and the second coefficient are both 0.5.
例如,当多模型结构分类投票模型输出的所述处理文本的预测结果是(标签1,置信度0.8),所述多任务分类模型的预测结果是(标签2,置信度0.9),则取置信度大的类型结果,且将对应的置信度乘以0.5,最后确定待分类文本的类型为标签2,置信度为0.45,输出(标签2,置信度0.45)。For example, when the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8), and the prediction result of the multi-task classification model is (label 2, confidence level 0.9), then take the confidence level The type result with large degree is multiplied by 0.5, and finally the type of the text to be classified is determined as label 2, the confidence degree is 0.45, and the output is (label 2, confidence degree 0.45).
当多模型结构分类投票模型输出的所述处理文本的预测结果是(标签1,置信度0.8),多任务分类模型的预测结果是(标签2,置信度0.8),置信度相同,则随机选取一个类型结果,并将对应的置信度乘以0.5,最后确定待分类文本的类型为标签1或标签2,置信度为0.4,输出(标签1或2,置信度0.4)。When the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8), the prediction result of the multi-task classification model is (label 2, confidence level 0.8), and the confidence levels are the same, then randomly select A type result, multiply the corresponding confidence by 0.5, and finally determine the type of the text to be classified as label 1 or label 2, the confidence is 0.4, and output (label 1 or 2, confidence 0.4).
本申请所述装置还可以包括样本添加模块,所述样本添加模块,用于将所述所述待分类文本所属的分类标签对应的分类置信度添加至所述训练样本集。The apparatus described in the present application may further include a sample adding module, which is configured to add the classification confidence corresponding to the classification label to which the text to be classified belongs to the training sample set.
通过不断扩充训练样本集可以进一步优化所述多任务分类模型/或多模型结构分类投票模型,有利于增加置信度结果的准确性。The multi-task classification model/or the multi-model structure classification voting model can be further optimized by continuously expanding the training sample set, which is beneficial to increase the accuracy of the confidence result.
本申请实施例通过多模型结构分类投票模型和多任务分类模型分别对待分类文本进行分类,得到所述处理文本属于第一分类标签的第一置信度和所述处理文本属于第二分类标签的第二置信度,再根据第一置信度和第二置信度确定处理文本所述的分类标签,能够通过综合不同模型的进行类别判断,提高文本类别判断的准确性,且在对文本分类的过程中,无需人工干预,提高文本类别判断的效率;同时,由于处理文本属于第一分类标签的第一置信度是通过多模型结构分类投票模型中的各个基模型对处理文本进行分析得到的,多次分析,提高文本类别判断的的准确性。因此,本申请提出的文本分类装置可以实现既提高文本分类结果的可靠性又提高文本分类的效率的目的。In this embodiment of the present application, the multi-model structure classification voting model and the multi-task classification model are used to classify the text to be classified, respectively, to obtain the first confidence that the processed text belongs to the first classification label and the first confidence that the processed text belongs to the second classification label. Two confidence levels, and then determine the classification label of the processing text according to the first confidence level and the second confidence level, which can improve the accuracy of text category judgment by combining different models for category judgment, and in the process of text classification. , without manual intervention, improving the efficiency of text category judgment; at the same time, since the first confidence level that the processed text belongs to the first classification label is obtained by analyzing the processed text through each base model in the multi-model structure classification voting model, many times Analysis to improve the accuracy of text category judgment. Therefore, the text classification device proposed in the present application can achieve the purpose of improving both the reliability of the text classification result and the efficiency of the text classification.
如图3所示,是本申请实现一种文本分类方法的电子设备的结构示意图。As shown in FIG. 3 , it is a schematic structural diagram of an electronic device implementing a text classification method in the present application.
所述电子设备1可以包括处理器10、存储器11和总线,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如一种文本分类程序12。The electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a text classification program 12.
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。所述存储器11在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括电子设备1的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如一种文本分类程序12的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc. The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 . In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as a code of a text classification program 12, etc., but also can be used to temporarily store data that has been output or will be output.
所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多 个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(Control Unit),利用各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的程序或者模块(例如执行一种文本分类程序等),以及调用存储在所述存储器11内的数据,以执行电子设备1的各种功能和处理数据。In some embodiments, the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits. Central Processing Unit (CPU), microprocessor, digital processing chip, graphics processor and combination of various control chips, etc. The processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the program) stored in the memory 11. A text classification program, etc.), and call data stored in the memory 11 to perform various functions of the electronic device 1 and process data.
所述总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。The bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
图3仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图3示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the figure. components, or a combination of certain components, or a different arrangement of components.
例如,尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。For example, although not shown, the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management The device implements functions such as charge management, discharge management, and power consumption management. The power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components. The electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。Further, the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。It should be understood that the embodiments are only used for illustration, and are not limited by this structure in the scope of the patent application.
所述电子设备1中的所述存储器11存储的一种文本分类程序12是多个计算机程序的组合,在所述处理器10中运行时,可以实现:A text classification program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple computer programs, and when running in the processor 10, can realize:
获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到;Obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
获取待分类文本,对所述待分类文本进行预处理,得到处理文本;Obtain the text to be classified, and preprocess the text to be classified to obtain the processed text;
将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度;The processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained. The confidence space includes a first confidence that the processed text belongs to the first classification label;
将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度;Inputting the processed text into the multi-task classification model, and classifying the processed text in the multi-task classification model to obtain a second confidence space, where the second confidence space includes the processed text the second confidence level belonging to the second classification label;
根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。According to the first confidence degree space and the second confidence degree space, the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。所述计算机可读存储介质可以是易失性的,也可以是非易失性的。例如,所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、 计算机存储器、只读存储器(ROM,Read-Only Memory)。Further, if the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. The computer-readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only) Memory).
进一步地,所述计算机可用存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质可以是易失性的,也可以是非易失性的,所述可读存储介质存储有计算机程序,所述计算机程序在被电子设备的处理器所执行时,可以实现:The present application also provides a computer-readable storage medium. The computer-readable storage medium may be volatile or non-volatile. The readable storage medium stores a computer program, and the computer program is stored in the When executed by the processor of the electronic device, it can achieve:
获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到;Obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
获取待分类文本,对所述待分类文本进行预处理,得到处理文本;Obtain the text to be classified, and preprocess the text to be classified to obtain the processed text;
将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度;The processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained. The confidence space includes a first confidence that the processed text belongs to the first classification label;
将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度;Inputting the processed text into the multi-task classification model, and classifying the processed text in the multi-task classification model to obtain a second confidence space, where the second confidence space includes the processed text the second confidence level belonging to the second classification label;
根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。According to the first confidence degree space and the second confidence degree space, the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。It will be apparent to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, but that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图表记视为限制所涉及的权利要求。Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any accompanying reference signs in the claims should not be construed as limiting the involved claims.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。***权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. Several units or means recited in the system claims can also be realized by one unit or means by means of software or hardware. Second-class terms are used to denote names and do not denote any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims (20)

  1. 一种文本分类方法,其中,所述方法包括:A text classification method, wherein the method comprises:
    获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到;Obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
    获取待分类文本,对所述待分类文本进行预处理,得到处理文本;Obtain the text to be classified, and preprocess the text to be classified to obtain the processed text;
    将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度;The processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained. The confidence space includes a first confidence that the processed text belongs to the first classification label;
    将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度;Inputting the processed text into the multi-task classification model, and classifying the processed text in the multi-task classification model to obtain a second confidence space, where the second confidence space includes the processed text the second confidence level belonging to the second classification label;
    根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。According to the first confidence degree space and the second confidence degree space, the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  2. 如权利要求1所述的文本分类方法,其中,所述获取多模型结构分类投票模型和多任务分类模型之前,所述方法还包括:The text classification method according to claim 1, wherein, before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the method further comprises:
    获取所述训练样本集;obtain the training sample set;
    根据随机森林算法和所述训练样本集训练预构建的分类模型,得到多个文本分类模型;Train the pre-built classification model according to the random forest algorithm and the training sample set to obtain multiple text classification models;
    利用所述多个文本分类模型构建所述多模型结构分类投票模型。The multi-model structure classification voting model is constructed using the plurality of text classification models.
  3. 如权利要求2所述的文本分类方法,其中,所述利用所述多个文本分类模型构建所述多模型结构分类投票模型,包括:The method for text classification according to claim 2, wherein the constructing the multi-model structure classification voting model by using the plurality of text classification models comprises:
    利用所述多个文本分类模型对预构建的模型测试样本进行分类,得到分类结果及所述分类结果对应的置信度;Classify the pre-built model test samples by using the multiple text classification models to obtain a classification result and a confidence level corresponding to the classification result;
    根据所述置信度的大小对所述多个文本分类模型进行排序,得到基模型排序表;Sorting the multiple text classification models according to the size of the confidence, to obtain a base model sorting table;
    根据所述基模型排序表对所述多个文本分类模型按照预设的权重梯度值进行权重设置,得到所述多模型结构分类投票模型。According to the base model ranking table, weights are set for the multiple text classification models according to preset weight gradient values, so as to obtain the multi-model structure classification voting model.
  4. 如权利要求1所述的文本分类方法,其中,所述获取多模型结构分类投票模型和多任务分类模型之前,所述方法还包括:The text classification method according to claim 1, wherein, before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the method further comprises:
    将所述分类模型中的分类损失与预构建的相似度损失进行组合,得到改进损失,将所述分类模型中的分类损失替换为所述改进损失,得到优化分类模型;Combining the classification loss in the classification model with the pre-built similarity loss to obtain an improved loss, and replacing the classification loss in the classification model with the improved loss to obtain an optimized classification model;
    利用所述优化分类模型中的特征提取神经网络对所述训练样本集进行特征提取,得到语句向量;Use the feature extraction neural network in the optimized classification model to perform feature extraction on the training sample set to obtain a sentence vector;
    通过所述语句向量对所述优化分类模型进行训练,直至所述所述优化分类模型的改进损失的下降梯度在预设的训练步骤内小于预设损失阈值时,得到所述多任务分类模型。The optimized classification model is trained by using the sentence vector, and the multi-task classification model is obtained when the descending gradient of the improved loss of the optimized classification model is smaller than a preset loss threshold within a preset training step.
  5. 如权利要求1所述的文本分类方法,其中,所述根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签,包括:The text classification method according to claim 1, wherein determining the classification label to which the text to be classified belongs according to the first confidence space and the second confidence space, comprising:
    当所述第一分类标签与所述第二分类标签相同时,确定所述待分类文本所属的分类标签为所述第一分类标签/或所述第二分类标签,以及确定所述待分类文本所述的分类标签对应的置信度为所述第一置信度及所述第二置信度的平均值。When the first classification label is the same as the second classification label, determining the classification label to which the text to be classified belongs is the first classification label/or the second classification label, and determining the text to be classified The confidence level corresponding to the classification label is the average value of the first confidence level and the second confidence level.
  6. 如权利要求1所述的文本分类方法,其中,所述根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签,包括:The text classification method according to claim 1, wherein determining the classification label to which the text to be classified belongs according to the first confidence space and the second confidence space, comprising:
    当所述第一分类标签与所述第二分类标签不相同时,判断所述第一置信度是否大于所述第二置信度;When the first classification label is different from the second classification label, determine whether the first confidence level is greater than the second confidence level;
    若所述第一置信度大于所述第二置信度,确定所述待分类文本所属的分类标签为第一分类标签,类别结果对应的置信度为所述第一置信度乘以第一系数;If the first confidence level is greater than the second confidence level, determine that the category label to which the text to be classified belongs is the first category label, and the confidence level corresponding to the category result is the first confidence level multiplied by a first coefficient;
    若所述第一置信度不大于所述第二置信度,确定所述待分类文本所属的分类标签为第二分类标签,类别结果对应置信度为第二置信度乘以第二系数。If the first confidence level is not greater than the second confidence level, it is determined that the category label to which the text to be classified belongs is the second category label, and the confidence level corresponding to the category result is the second confidence level multiplied by a second coefficient.
  7. 如权利要求1至6中任一项所述的文本分类方法,其中,所述对所述待分类文本进行预处理,得到处理文本,包括:The text classification method according to any one of claims 1 to 6, wherein the preprocessing of the to-be-classified text to obtain the processed text comprises:
    对所述待分类文本进行标点符合分割或句子长度分割,得到所述处理文本。Punctuation coincidence segmentation or sentence length segmentation is performed on the text to be classified to obtain the processed text.
  8. 一种文本分类装置,其中,所述装置包括:A text classification device, wherein the device comprises:
    模型获取模块,用于获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到;A model acquisition module, used for acquiring a multi-model structure classification voting model and a multi-task classification model, the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
    文本预处理模块,用于获取待分类文本,对所述待分类文本进行预处理,得到处理文本;a text preprocessing module, used to obtain the text to be classified, and preprocess the text to be classified to obtain the processed text;
    第一模型分析模块,用于将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度;The first model analysis module is used to input the processed text into the multi-model structure classification voting model, and classify the processed text through a plurality of base models in the multi-model structure classification voting model to obtain the first a confidence space, where the first confidence space includes a first confidence that the processed text belongs to a first classification label;
    第二模型分析模块,用于将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度;The second model analysis module is configured to input the processed text into the multi-task classification model, and obtain a second confidence space by classifying the processed text in the multi-task classification model. The confidence space includes the second confidence that the processed text belongs to the second classification label;
    结果处理模块,用于根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。A result processing module, configured to determine, according to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs and the classification confidence corresponding to the classification label.
  9. 一种电子设备,其中,所述电子设备包括:An electronic device, wherein the electronic device comprises:
    至少一个处理器;以及,at least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序指令,所述计算机程序指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下步骤:The memory stores computer program instructions executable by the at least one processor, the computer program instructions being executed by the at least one processor to enable the at least one processor to perform the steps of:
    获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到;Obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
    获取待分类文本,对所述待分类文本进行预处理,得到处理文本;Obtain the text to be classified, and preprocess the text to be classified to obtain the processed text;
    将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度;The processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained. The confidence space includes a first confidence that the processed text belongs to the first classification label;
    将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度;Inputting the processed text into the multi-task classification model, and classifying the processed text in the multi-task classification model to obtain a second confidence space, where the second confidence space includes the processed text the second confidence level belonging to the second classification label;
    根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。According to the first confidence degree space and the second confidence degree space, the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  10. 如权利要求9所述的电子设备,其中,所述获取多模型结构分类投票模型和多任务分类模型之前,所述计算机程序指令被所述至少一个处理器执行时还实现如下步骤:The electronic device according to claim 9, wherein, before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the computer program instructions further implement the following steps when executed by the at least one processor:
    获取所述训练样本集;obtain the training sample set;
    根据随机森林算法和所述训练样本集训练预构建的分类模型,得到多个文本分类模型;Train the pre-built classification model according to the random forest algorithm and the training sample set to obtain multiple text classification models;
    利用所述多个文本分类模型构建所述多模型结构分类投票模型。The multi-model structure classification voting model is constructed using the plurality of text classification models.
  11. 如权利要求10所述的电子设备,其中,所述利用所述多个文本分类模型构建所述多模型结构分类投票模型,包括:The electronic device according to claim 10, wherein the constructing the multi-model structure classification voting model by using the plurality of text classification models comprises:
    利用所述多个文本分类模型对预构建的模型测试样本进行分类,得到分类结果及所述分类结果对应的置信度;Classify the pre-built model test samples by using the multiple text classification models to obtain a classification result and a confidence level corresponding to the classification result;
    根据所述置信度的大小对所述多个文本分类模型进行排序,得到基模型排序表;Sorting the multiple text classification models according to the size of the confidence, to obtain a base model sorting table;
    根据所述基模型排序表对所述多个文本分类模型按照预设的权重梯度值进行权重设 置,得到所述多模型结构分类投票模型。According to the base model sorting table, the weights of the multiple text classification models are set according to the preset weight gradient values, so as to obtain the multi-model structure classification voting model.
  12. 如权利要求9所述的电子设备,其中,所述获取多模型结构分类投票模型和多任务分类模型之前,所述计算机程序指令被所述至少一个处理器执行时还实现如下步骤:The electronic device according to claim 9, wherein, before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the computer program instructions further implement the following steps when executed by the at least one processor:
    将所述分类模型中的分类损失与预构建的相似度损失进行组合,得到改进损失,将所述分类模型中的分类损失替换为所述改进损失,得到优化分类模型;Combining the classification loss in the classification model with the pre-built similarity loss to obtain an improved loss, and replacing the classification loss in the classification model with the improved loss to obtain an optimized classification model;
    利用所述优化分类模型中的特征提取神经网络对所述训练样本集进行特征提取,得到语句向量;Use the feature extraction neural network in the optimized classification model to perform feature extraction on the training sample set to obtain a sentence vector;
    通过所述语句向量对所述优化分类模型进行训练,直至所述所述优化分类模型的改进损失的下降梯度在预设的训练步骤内小于预设损失阈值时,得到所述多任务分类模型。The optimized classification model is trained by using the sentence vector, and the multi-task classification model is obtained when the descending gradient of the improved loss of the optimized classification model is smaller than a preset loss threshold within a preset training step.
  13. 如权利要求9所述的电子设备,其中,所述根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签,包括:The electronic device according to claim 9, wherein the determining, according to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs comprises:
    当所述第一分类标签与所述第二分类标签相同时,确定所述待分类文本所属的分类标签为所述第一分类标签/或所述第二分类标签,以及确定所述待分类文本所述的分类标签对应的置信度为所述第一置信度及所述第二置信度的平均值。When the first classification label is the same as the second classification label, determining the classification label to which the text to be classified belongs is the first classification label/or the second classification label, and determining the text to be classified The confidence level corresponding to the classification label is the average value of the first confidence level and the second confidence level.
  14. 如权利要求9所述的电子设备,其中,所述根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签,包括:The electronic device according to claim 9, wherein the determining, according to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs comprises:
    当所述第一分类标签与所述第二分类标签不相同时,判断所述第一置信度是否大于所述第二置信度;When the first classification label is different from the second classification label, determine whether the first confidence level is greater than the second confidence level;
    若所述第一置信度大于所述第二置信度,确定所述待分类文本所属的分类标签为第一分类标签,类别结果对应的置信度为所述第一置信度乘以第一系数;If the first confidence level is greater than the second confidence level, determine that the category label to which the text to be classified belongs is the first category label, and the confidence level corresponding to the category result is the first confidence level multiplied by a first coefficient;
    若所述第一置信度不大于所述第二置信度,确定所述待分类文本所属的分类标签为第二分类标签,类别结果对应置信度为第二置信度乘以第二系数。If the first confidence level is not greater than the second confidence level, it is determined that the category label to which the text to be classified belongs is the second category label, and the confidence level corresponding to the category result is the second confidence level multiplied by a second coefficient.
  15. 如权利要求9至14中任一项所述的电子设备,其中,所述对所述待分类文本进行预处理,得到处理文本,包括:The electronic device according to any one of claims 9 to 14, wherein the preprocessing of the text to be classified to obtain the processed text comprises:
    对所述待分类文本进行标点符合分割或句子长度分割,得到所述处理文本。Punctuation coincidence segmentation or sentence length segmentation is performed on the text to be classified to obtain the processed text.
  16. 一种计算机可读存储介质,包括存储数据区和存储程序区,存储数据区存储创建的数据,存储程序区存储有计算机程序;其中,所述计算机程序被处理器执行时实现如下步骤:A computer-readable storage medium, comprising a storage data area and a storage program area, the storage data area stores data created, and the storage program area stores a computer program; wherein, the computer program is executed by a processor The following steps are implemented:
    获取多模型结构分类投票模型和多任务分类模型,所述多模型结构分类投票模型和所述多任务分类模型通过预构建的分类模型和训练样本集得到;Obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
    获取待分类文本,对所述待分类文本进行预处理,得到处理文本;Obtain the text to be classified, and preprocess the text to be classified to obtain the processed text;
    将所述处理文本输入至所述多模型结构分类投票模型,通过所述多模型结构分类投票模型中的多个基模型对所述处理文本进行分类,得到第一置信度空间,所述第一置信度空间包括所述处理文本属于第一分类标签的第一置信度;The processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained. The confidence space includes a first confidence that the processed text belongs to the first classification label;
    将所述处理文本输入至所述多任务分类模型,通过所述多任务分类模型中对所述处理文本进行分类,得到得到第二置信度空间,所述第二置信度空间包括所述处理文本属于第二分类标签的第二置信度;Inputting the processed text into the multi-task classification model, and classifying the processed text in the multi-task classification model to obtain a second confidence space, where the second confidence space includes the processed text the second confidence level belonging to the second classification label;
    根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签以及所述分类标签对应的分类置信度。According to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs and the classification confidence corresponding to the classification label are determined.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述获取多模型结构分类投票模型和多任务分类模型之前,所述计算机程序被处理器执行时还实现如下步骤:The computer-readable storage medium according to claim 16, wherein, before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the computer program further implements the following steps when executed by the processor:
    获取所述训练样本集;obtain the training sample set;
    根据随机森林算法和所述训练样本集训练预构建的分类模型,得到多个文本分类模型;Train the pre-built classification model according to the random forest algorithm and the training sample set to obtain multiple text classification models;
    利用所述多个文本分类模型构建所述多模型结构分类投票模型。The multi-model structure classification voting model is constructed using the plurality of text classification models.
  18. 如权利要求17所述的计算机可读存储介质,其中,所述利用所述多个文本分类 模型构建所述多模型结构分类投票模型,包括:The computer-readable storage medium of claim 17, wherein the constructing the multi-model structured classification voting model using the plurality of text classification models comprises:
    利用所述多个文本分类模型对预构建的模型测试样本进行分类,得到分类结果及所述分类结果对应的置信度;Classify the pre-built model test samples by using the multiple text classification models to obtain a classification result and a confidence level corresponding to the classification result;
    根据所述置信度的大小对所述多个文本分类模型进行排序,得到基模型排序表;Sorting the multiple text classification models according to the size of the confidence, to obtain a base model sorting table;
    根据所述基模型排序表对所述多个文本分类模型按照预设的权重梯度值进行权重设置,得到所述多模型结构分类投票模型。Weight setting is performed on the multiple text classification models according to a preset weight gradient value according to the base model ranking table, so as to obtain the multi-model structure classification voting model.
  19. 如权利要求16所述的计算机可读存储介质,其中,所述获取多模型结构分类投票模型和多任务分类模型之前,所述计算机程序被处理器执行时还实现如下步骤:The computer-readable storage medium according to claim 16, wherein, before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the computer program further implements the following steps when executed by the processor:
    将所述分类模型中的分类损失与预构建的相似度损失进行组合,得到改进损失,将所述分类模型中的分类损失替换为所述改进损失,得到优化分类模型;combining the classification loss in the classification model with the pre-built similarity loss to obtain an improved loss, and replacing the classification loss in the classification model with the improved loss to obtain an optimized classification model;
    利用所述优化分类模型中的特征提取神经网络对所述训练样本集进行特征提取,得到语句向量;Use the feature extraction neural network in the optimized classification model to perform feature extraction on the training sample set to obtain a sentence vector;
    通过所述语句向量对所述优化分类模型进行训练,直至所述所述优化分类模型的改进损失的下降梯度在预设的训练步骤内小于预设损失阈值时,得到所述多任务分类模型。The optimized classification model is trained by using the sentence vector, and the multi-task classification model is obtained when the descending gradient of the improved loss of the optimized classification model is smaller than a preset loss threshold within a preset training step.
  20. 如权利要求16所述的计算机可读存储介质,其中,所述根据所述第一置信度空间和所述第二置信度空间,确定所述待分类文本所属的分类标签,包括:The computer-readable storage medium according to claim 16, wherein the determining, according to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs comprises:
    当所述第一分类标签与所述第二分类标签相同时,确定所述待分类文本所属的分类标签为所述第一分类标签/或所述第二分类标签,以及确定所述待分类文本所述的分类标签对应的置信度为所述第一置信度及所述第二置信度的平均值。When the first classification label is the same as the second classification label, determine the classification label to which the text to be classified belongs is the first classification label/or the second classification label, and determine the text to be classified The confidence level corresponding to the classification label is an average value of the first confidence level and the second confidence level.
PCT/CN2021/083560 2021-01-28 2021-03-29 Text classification method and apparatus, electronic device, and storage medium WO2022160449A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110121141.9 2021-01-28
CN202110121141.9A CN112883190A (en) 2021-01-28 2021-01-28 Text classification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022160449A1 true WO2022160449A1 (en) 2022-08-04

Family

ID=76053277

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083560 WO2022160449A1 (en) 2021-01-28 2021-03-29 Text classification method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN112883190A (en)
WO (1) WO2022160449A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115049836A (en) * 2022-08-16 2022-09-13 平安科技(深圳)有限公司 Image segmentation method, device, equipment and storage medium
CN115168594A (en) * 2022-09-08 2022-10-11 北京星天地信息科技有限公司 Alarm information processing method and device, electronic equipment and storage medium
CN115409104A (en) * 2022-08-25 2022-11-29 贝壳找房(北京)科技有限公司 Method, apparatus, device, medium and program product for identifying object type
CN115827875A (en) * 2023-01-09 2023-03-21 无锡容智技术有限公司 Text data processing terminal searching method
CN117235270A (en) * 2023-11-16 2023-12-15 中国人民解放军国防科技大学 Text classification method and device based on belief confusion matrix and computer equipment
CN117473339A (en) * 2023-12-28 2024-01-30 智者四海(北京)技术有限公司 Content auditing method and device, electronic equipment and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378826B (en) * 2021-08-11 2021-12-07 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN115470292B (en) * 2022-08-22 2023-10-10 深圳市沃享科技有限公司 Block chain consensus method, device, electronic equipment and readable storage medium
CN116383724B (en) * 2023-02-16 2023-12-05 北京数美时代科技有限公司 Single-domain label vector extraction method and device, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992887A (en) * 2017-11-28 2018-05-04 东软集团股份有限公司 Classifier generation method, sorting technique, device, electronic equipment and storage medium
CN110019794A (en) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 Classification method, device, storage medium and the electronic device of textual resources
CN110309302A (en) * 2019-05-17 2019-10-08 江苏大学 A kind of uneven file classification method and system of combination SVM and semi-supervised clustering
US10460257B2 (en) * 2016-09-08 2019-10-29 Conduent Business Services, Llc Method and system for training a target domain classifier to label text segments

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389270B (en) * 2017-08-09 2022-11-04 菜鸟智能物流控股有限公司 Logistics object determination method and device and machine readable medium
CN108108766B (en) * 2017-12-28 2021-10-29 东南大学 Driving behavior identification method and system based on multi-sensor data fusion
US10832003B2 (en) * 2018-08-26 2020-11-10 CloudMinds Technology, Inc. Method and system for intent classification
CN110377727B (en) * 2019-06-06 2022-06-17 深思考人工智能机器人科技(北京)有限公司 Multi-label text classification method and device based on multi-task learning
CN110765267A (en) * 2019-10-12 2020-02-07 大连理工大学 Dynamic incomplete data classification method based on multi-task learning
CN111444952B (en) * 2020-03-24 2024-02-20 腾讯科技(深圳)有限公司 Sample recognition model generation method, device, computer equipment and storage medium
CN112256880A (en) * 2020-11-11 2021-01-22 腾讯科技(深圳)有限公司 Text recognition method and device, storage medium and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10460257B2 (en) * 2016-09-08 2019-10-29 Conduent Business Services, Llc Method and system for training a target domain classifier to label text segments
CN110019794A (en) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 Classification method, device, storage medium and the electronic device of textual resources
CN107992887A (en) * 2017-11-28 2018-05-04 东软集团股份有限公司 Classifier generation method, sorting technique, device, electronic equipment and storage medium
CN110309302A (en) * 2019-05-17 2019-10-08 江苏大学 A kind of uneven file classification method and system of combination SVM and semi-supervised clustering

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115049836A (en) * 2022-08-16 2022-09-13 平安科技(深圳)有限公司 Image segmentation method, device, equipment and storage medium
CN115409104A (en) * 2022-08-25 2022-11-29 贝壳找房(北京)科技有限公司 Method, apparatus, device, medium and program product for identifying object type
CN115168594A (en) * 2022-09-08 2022-10-11 北京星天地信息科技有限公司 Alarm information processing method and device, electronic equipment and storage medium
CN115827875A (en) * 2023-01-09 2023-03-21 无锡容智技术有限公司 Text data processing terminal searching method
CN115827875B (en) * 2023-01-09 2023-04-25 无锡容智技术有限公司 Text data processing terminal searching method
CN117235270A (en) * 2023-11-16 2023-12-15 中国人民解放军国防科技大学 Text classification method and device based on belief confusion matrix and computer equipment
CN117235270B (en) * 2023-11-16 2024-02-02 中国人民解放军国防科技大学 Text classification method and device based on belief confusion matrix and computer equipment
CN117473339A (en) * 2023-12-28 2024-01-30 智者四海(北京)技术有限公司 Content auditing method and device, electronic equipment and storage medium
CN117473339B (en) * 2023-12-28 2024-04-30 智者四海(北京)技术有限公司 Content auditing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112883190A (en) 2021-06-01

Similar Documents

Publication Publication Date Title
WO2022160449A1 (en) Text classification method and apparatus, electronic device, and storage medium
WO2022121171A1 (en) Similar text matching method and apparatus, and electronic device and computer storage medium
WO2022141861A1 (en) Emotion classification method and apparatus, electronic device, and storage medium
CN111460797B (en) Keyword extraction method and device, electronic equipment and readable storage medium
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
CN112906377A (en) Question answering method and device based on entity limitation, electronic equipment and storage medium
CN114491047A (en) Multi-label text classification method and device, electronic equipment and storage medium
CN113887941A (en) Business process generation method and device, electronic equipment and medium
CN111522782A (en) File data writing method and device and computer readable storage medium
CN113313211A (en) Text classification method and device, electronic equipment and storage medium
CN116578696A (en) Text abstract generation method, device, equipment and storage medium
WO2022141838A1 (en) Model confidence analysis method and apparatus, electronic device and computer storage medium
CN116226315A (en) Sensitive information detection method and device based on artificial intelligence and related equipment
WO2022141860A1 (en) Text deduplication method and apparatus, electronic device, and computer readable storage medium
WO2022222228A1 (en) Method and apparatus for recognizing bad textual information, and electronic device and storage medium
CN115146064A (en) Intention recognition model optimization method, device, equipment and storage medium
WO2022141867A1 (en) Speech recognition method and apparatus, and electronic device and readable storage medium
CN113888265A (en) Product recommendation method, device, equipment and computer-readable storage medium
CN115221274A (en) Text emotion classification method and device, electronic equipment and storage medium
CN113343102A (en) Data recommendation method and device based on feature screening, electronic equipment and medium
CN113011164A (en) Data quality detection method, device, electronic equipment and medium
CN116991364B (en) Software development system management method based on big data
CN113486266B (en) Page label adding method, device, equipment and storage medium
WO2022227170A1 (en) Method and apparatus for generating cross-language word vector, electronic device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21922048

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21922048

Country of ref document: EP

Kind code of ref document: A1