WO2018166499A1 - 文本分类方法、设备和存储介质 - Google Patents

文本分类方法、设备和存储介质 Download PDF

Info

Publication number
WO2018166499A1
WO2018166499A1 PCT/CN2018/079136 CN2018079136W WO2018166499A1 WO 2018166499 A1 WO2018166499 A1 WO 2018166499A1 CN 2018079136 W CN2018079136 W CN 2018079136W WO 2018166499 A1 WO2018166499 A1 WO 2018166499A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
classification
classification result
classifier
category
Prior art date
Application number
PCT/CN2018/079136
Other languages
English (en)
French (fr)
Inventor
李探
温旭
常卓
闫清岭
张智敏
王树伟
花少勇
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018166499A1 publication Critical patent/WO2018166499A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present application relates to text classification technology in the field of computers, and in particular, to a text classification method, device and storage medium.
  • Text classifiers can be divided into two main categories: text classifiers based on a priori rules and text classifiers based on models.
  • the classification rules of text classifiers based on prior rules need to rely on artificial mining or accumulation of prior knowledge.
  • Model-based text classifiers mainly use data mining and machine learning method models.
  • the classification error often occurs, which leads to the reduction of the classification accuracy and recall rate.
  • the multi-level classification if the upper-level class has an error, it will directly Affects the accuracy of all sub-categories below. Therefore, how to accurately classify text is the key to solving the above problems.
  • the embodiment of the present application is expected to provide a text classification method, device, and storage medium, which solves the problem of classification errors in the existing text classification scheme, improves the accuracy of text classification, and enhances maintainability. Sex and scalability.
  • An embodiment of the present application provides a text classification method, which is applied to a computing device, and the method includes:
  • the first classifier is used to classify the text to be classified to obtain a first classification result
  • the second classifier is used to classify the incorrectly classified text in the first classification result to obtain a second classification result; wherein the classification parameter of the second classifier is associated with the classification parameter of the first classifier relationship;
  • Embodiments of the present application also provide a text classification device, which is applied to a computing device, the text classification device including a memory and a processor, wherein:
  • the memory stores instructions executable by the processor, and when the instructions are executed, the processor is configured to:
  • the first classifier is used to classify the text to be classified to obtain a first classification result
  • the second classifier is used to classify the incorrectly classified text in the first classification result to obtain a second classification result; wherein the classification parameter of the second classifier is associated with the classification parameter of the first classifier relationship;
  • the embodiment of the present application further provides a non-transitory computer readable storage medium storing computer readable instructions, which can cause at least one processor to perform the method as described above.
  • FIG. 1 is a schematic flowchart diagram of a text classification method according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart diagram of another text classification method according to an embodiment of the present application.
  • FIG. 3 is a schematic flowchart diagram of still another text classification method according to an embodiment of the present application.
  • FIG. 4 is a schematic flowchart diagram of a text classification method according to another embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a text classification apparatus according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of another text sorting apparatus according to an embodiment of the present application.
  • FIG. 6B is a schematic structural diagram of a system for applying a text classification method according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a text classification device according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of hardware of a text classification device according to an embodiment of the present application.
  • one solution is to add a series of manual rules to modify the classification of classification errors, but the rules usually cannot cover all cases, and may also cause mis-modification.
  • Another option is to modify the classifier model, including adjusting the characteristics of each category, or modifying the parameters of the classifier model.
  • the above two solutions still have problems in that the classification cannot be accurately modified, or the classification accuracy is lowered, and the maintainability and scalability are deteriorated.
  • a text classification method of an example of the present application may be applied to a computing device, where the computing device may include a terminal device or a server; as shown in FIG. 1, the method includes the following steps:
  • Step 101 Obtain a text to be classified.
  • the step 101 of acquiring the text to be classified may be implemented by the text classification device;
  • the text classification device may be a device capable of classifying the text information, and may be, for example, a mobile terminal or a server capable of classifying the text information.
  • the text to be classified may be text information that has been previously stored in the mobile terminal or server and needs to be classified.
  • the feasible implementation of obtaining the text to be classified is as follows.
  • the mobile terminal or the server may send a text information acquisition instruction, where the acquisition instruction has identification information, and the mobile terminal or the server may receive the text stored by itself after receiving the acquisition instruction sent by the user.
  • the user needs to send a text information acquisition instruction to the mobile terminal or the server when classifying some text information,
  • the obtaining instruction has the identification information, and after receiving the obtaining instruction sent by the user, the mobile terminal or the server may forward the obtaining instruction to the server, and obtain the text information corresponding to the identification information from the server, and finally obtain the text information to be classified.
  • the text may be a news, a post, an article, a product description (for example, an introduction to an application), and the like, and any text that needs to be classified in the implementation process may be used as the text in this embodiment.
  • Step 102 Perform classification by using the first classifier to classify the classified text to obtain a first classification result.
  • the first classifier is used to classify the classified text, and the first classification result is obtained by the text classification device.
  • the first classification result may be information of the classification result obtained by classifying the classified text, and the first classification result may include at least two classification information.
  • the article R is classified into two articles (A category and B category), and the article R is classified into two categories (A category and B category).
  • the article R is classified by the first classifier.
  • two groups of classifications can be obtained.
  • the texts of the first group of categories are all A
  • the texts of the second group of categories include the texts of category A and category B; wherein the text of category A in the second group is A.
  • B the description will be divided according to the classification categories of text a a b B wrong text is divided into a second set of category classification, there has been an incorrect classification problem.
  • the text classification in the above embodiment is to divide the texts of the same category into one group, and the category of the group can be determined according to the text category in the group, when the group includes at least two categories of text, then The classification is considered incorrect. And because the group includes at least two categories of text, the category of the group is incorrect regardless of which text category in the group.
  • Step 103 The second classifier is used to classify the incorrectly classified text in the first classification result to obtain a second classification result.
  • the classification parameter of the second classifier has an association relationship with the classification parameter of the first classifier.
  • the parameters of the first classifier are generated based on the feature information of the text in the text to be classified.
  • the parameters of the second classifier are set based on the feature information of the text in which the error is made after the first classification.
  • the second classifier is used to classify the incorrectly classified text in the first classification result, and the second classification result may be implemented by the text classification device; the classification parameter and the second classification of the first classifier
  • the classification parameter of the device When the classification parameter of the device is set, there is a certain relationship between the setting principle of the classification parameter of the first classifier and the setting principle of the classification parameter of the second classifier.
  • the second classification result may include at least two classification information, and one of the at least two classification information is the same as one of the first classification results.
  • the second classification result may be obtained by reclassifying the text of the second group classification of the categories including category A and category B obtained by classifying the article R.
  • the second classifier is used to classify the texts in the second group of classifications.
  • two groups of classifications can be obtained.
  • the category of the text in the third group classification is A
  • the category in the fourth group classification is B
  • the category of the text in the group classification is the same as the category of the text in the first group classification, and is the category A; moreover, the category of the text in the remaining group of categories is B, and there is no inclusion in a group of categories.
  • the second classification result after the classification can subdivide the text of the misclassification in the first classification result, and the classification information of the finally formed text is correct.
  • Step 104 Process, according to the first classification result and the second classification result, the text corresponding to the first classification result and the text corresponding to the second classification result to obtain the target text.
  • the step 104 is based on the first classification result and the second classification result, and processing the text corresponding to the first classification result and the text corresponding to the second classification result to obtain the target text may be implemented by the text classification device.
  • the categories common to the first classification result and the second classification result may be searched, and then the texts of the common category are merged into one text, and finally the target text is obtained.
  • each text in the target text belongs to the same category.
  • the text classification method obtained by the embodiment of the present application obtains the text to be classified, uses the first classifier to classify the classified text, obtains the first classification result, and uses the second classifier to classify the incorrectly classified text in the first classification result.
  • the classification parameter of the second classifier is associated with the classification parameter of the first classifier, and based on the first classification result and the second classification result, the text corresponding to the first classification result and the second
  • the text corresponding to the classification result is processed to obtain the target text; thus, after the classification of the text to be classified, the text with the misclassification after the classification can be further classified, and the classification of the text obtained after reclassifying the text having the misclassified classification All are correct, which solves the problem of classification errors in the existing text classification scheme, improves the accuracy of text classification, and enhances maintainability and scalability.
  • an embodiment of the present application provides a text classification method, where the method includes the following steps:
  • Step 201 The text classification device acquires the text to be classified.
  • Step 202 The text classification device uses the first classifier to classify the classified text to obtain a first classification result.
  • the classification of the classified text by the first classifier may be based on a preset classification parameter, and the first classifier is used to classify the classified text.
  • the preset classification parameter may be generated according to the feature information of the text to be classified, and the feature information of the text to be classified may be a parameter capable of characterizing the attribute information of the text to be classified, and may include, for example, a tool, a musical instrument, or the like.
  • the first classifier may be a text classifier based on a priori rules, and the classification rules need to be obtained by artificial mining or accumulation of prior knowledge; or may be a model-based text classifier, specifically including using data mining and machine learning.
  • Various method models such as nearest neighbor classifiers, logistic regression classifiers, decision tree classifiers, naive Bayes classifiers, support vector machine classifiers, artificial neural network classifiers, and the like.
  • Step 203 The text classification device determines whether there is a text with an incorrect classification in the first classification result.
  • determining whether the classified text is incorrect in the first classification result may be implemented by comparing whether the categories classified as all the texts in one category are the same; if there are at least two categories in the text classified into one category, Indicates that there is an incorrectly classified text in this category.
  • Step 204 If there is an incorrectly classified text in the first classification result, the text classification device acquires the incorrectly classified text in the first classification result.
  • the first classifier is used to classify the article R to obtain two groups of classifications, and the categories of the first group of classified texts are A, and the categories of the second group of classified texts are Including the text whose category is A and the category is B; therefore, there is text with incorrect classification in the second group classification, so it is necessary to obtain the text corresponding to the second group classification in the first classification result, that is, the category included in the acquisition category is A and The category is the text corresponding to the grouping of B.
  • Step 205 The text classification device acquires feature information of the incorrectly classified text in the first classification result.
  • the feature information of the corresponding text may be acquired.
  • the definition of the feature information here is the same as the definition of the feature information in step 202, except that the feature information here is a parameter of the attribute information of the text classified incorrectly in the first classification result. For example, it may be a parameter of attribute information of text corresponding to the second group of categories.
  • Step 206 The text classification device sets the classification parameter based on the feature information of the incorrectly classified text in the first classification result.
  • the classification parameter in the first classifier may be set according to the feature information of the incorrectly classified text in the first classification result that has been acquired, and finally the classification of the incorrectly classified text in the first classification result is implemented.
  • Step 207 The text classification device classifies the incorrectly classified text in the first classification result based on the classification parameter and uses the second classifier to obtain the second classification result.
  • the classification parameter of the first classifier is generated according to the feature information of the text in the text to be classified.
  • the classification parameter of the first classifier is different from the classification parameter of the second classifier.
  • the classification method used by the first classifier is the same as that used by the second classifier.
  • the classification method employed by the first classifier is different from the classification method employed by the second classifier.
  • the second classifier may be a text classifier based on a priori rules, and the classification rules need to be obtained by artificial mining or accumulation of prior knowledge; or may be a model-based text classifier, specifically including using data mining and machines.
  • Various method models for learning such as nearest neighbor classifiers, logistic regression classifiers, decision tree classifiers, naive Bayes classifiers, support vector machine classifiers, artificial neural network classifiers, and the like.
  • the classification methods adopted by the first classifier and the second classifier are all logistic regression classifiers, and the classification of the article R (newsletter article) to be classified as shown in FIG. 3 may first adopt a logistic regression classifier and based on the setting.
  • the first classification parameter original classification model
  • classifies the article R and obtains two groups of classification results.
  • the categories of the text in the first group classification are A (classification is correct), and the categories of text in the second group classification include A and B (classification error), the text errors that should be assigned to the first group are classified into the second group, and their categories are classified as B (ie, the second group classification); there are misclassifications in the apparent classification results.
  • Is a second group of classifications thereafter, continuing to obtain text corresponding to the second group of classifications, using a logistic regression classifier and based on the set second classification parameter (new classification model) for the second text including category A and category B
  • the group is reclassified to obtain two sets of classification results.
  • the category of the text in the third group classification is A (the classification is correct)
  • the category of the text in the fourth group classification is B (the classification is correct).
  • the categories of the text in the classification result are correct.
  • the first classification parameter is set according to the feature information of the article R
  • the second classification parameter is set according to the feature information of the text corresponding to the second group classification.
  • the first classification parameter is set according to the feature information of all texts, ie, article R, so there is a problem of text misclassification in the first classification result; the second is using logistic regression.
  • the classifier performs classification
  • the second classification parameter is set based on the feature information of the text having the error after the first classification (that is, the text including the category A and the category B). Because the setting of the second classification parameter is more precise, the classification result of the text obtained after the second classification is correct.
  • Step 208 The text classification device processes the text corresponding to the first classification result and the text corresponding to the second classification result according to the first classification result and the second classification result to obtain the target text.
  • the first classification result and the second classification result may be compared, and the text corresponding to the first classification result and the text corresponding to the second classification result are filtered according to the comparison result. And composition, and finally get the target text.
  • the text classification method obtained by the embodiment of the present application obtains the text to be classified, uses the first classifier to classify the classified text, obtains the first classification result, and uses the second classifier to classify the incorrectly classified text in the first classification result.
  • the classification parameter of the second classifier is associated with the classification parameter of the first classifier, and based on the first classification result and the second classification result, the text corresponding to the first classification result and the second
  • the text corresponding to the classification result is processed to obtain the target text; thus, after the classification of the text to be classified, the text with the misclassification after the classification can be further classified, and the classification of the text obtained after reclassifying the text having the misclassified classification All are correct, which solves the problem of classification errors in the existing text classification scheme, improves the accuracy of text classification, and enhances maintainability and scalability.
  • an embodiment of the present application provides a text classification method. Referring to FIG. 4, the method includes the following steps:
  • Step 301 The text classification device acquires the text to be classified.
  • Step 302 The text classification device uses the first classifier to classify the classified text to obtain a first classification result.
  • Step 303 The text classification device determines whether there is a text with an incorrect classification in the first classification result.
  • Step 304 If there is an incorrectly classified text in the first classification result, the text classification device acquires the incorrectly classified text in the first classification result.
  • Step 305 The text classification device acquires feature information of the incorrectly classified text in the first classification result.
  • Step 306 The text classification device sets the classification parameter based on the feature information of the incorrectly classified text in the first classification result.
  • Step 307 The text classification device classifies the incorrectly classified text in the first classification result based on the classification parameter and uses the second classifier to obtain the second classification result.
  • the classification parameter of the first classifier is generated according to the feature information of the text in the text to be classified.
  • classification parameter of the first classifier is different from the classification parameter of the second classifier.
  • classification method used by the first classifier is the same as that used by the second classifier.
  • the classification method employed by the first classifier is different from the classification method employed by the second classifier.
  • the classification method adopted by the first classifier is a logistic regression classifier
  • the classification method adopted by the second classifier is a decision tree classifier.
  • the logistic regression classifier may be adopted first and based on the first set.
  • the classification parameter classifies the article R, and obtains three groups of classification results.
  • the categories of the text in the first group classification are A
  • the categories of the text in the second group classification include A, B and C
  • the categories are all C; the second category is classified in the obvious classification result; after that, the text corresponding to the second group classification is continued, and the decision tree classifier is used and the category A is included based on the set second classification parameter pair.
  • the texts of category B and category C are classified to obtain three sets of classification results.
  • the category of the text in the third group classification is A
  • the category of the text in the fourth group classification is B
  • the category of the text in the fifth group classification is C.
  • the first classification parameter is set according to the feature information of the article R
  • the second classification parameter is set according to the feature information of the text corresponding to the second group classification (ie, the category including the category A, the category B, and the category C). Because the first classification parameter is classified according to the feature information of all texts, ie, article R, when the logistic regression classifier is used for the first time, there is a problem of text misclassification in the first classification result; the second decision tree classification is adopted.
  • the second classification parameter is set according to the text having the error after the first classification (that is, the text including category A, category B, and category C). Because the setting of the second classification parameter is more precise, the classification result of the text obtained after the second classification is correct.
  • Step 308 The text classification device acquires a category of the correctly classified text in the first classification result, and obtains the first category.
  • the first category includes at least one category.
  • the category of the correctly classified text in the first classification result may be obtained as category A and category C, that is, the first category may be category A and category C.
  • Step 309 The text classification device processes the text corresponding to the first classification result and the text corresponding to the second classification result according to the first category and the second classification result to obtain the target text.
  • step 309 based on the first category and the second classification result, processing the text corresponding to the first classification result and the text corresponding to the second classification result to obtain the target text, which can be implemented by:
  • Step 309a The text classification device acquires, according to the second classification result, the text of the category corresponding to the first category in the text corresponding to the second classification result, to obtain the first text collection.
  • the category of the text obtained from the text corresponding to the second classification result is the text of the first category, that is, may be the text corresponding to the third group classification, the fourth group classification, and the fifth group classification.
  • the texts of the categories A and C are obtained, and finally the text corresponding to the third group classification and the fifth group classification is obtained to obtain the first text collection.
  • the first text set includes at least two texts.
  • the text of the two categories e.g., A and C
  • the text of the two categories is included in the first text collection.
  • Step 309b The text classification device combines the first text set with the text belonging to the same category in the correctly classified text in the classification result to obtain the first target text.
  • the text of the category A in the first text set is combined with the text corresponding to the first group of categories, and the text of the category C in the first text set is combined with the text corresponding to the third group of categories, and finally Get the first target text.
  • the first target text includes at least one category of text.
  • the target text includes a first target text and a second target text.
  • Step 309c The text classification device acquires text in the text corresponding to the second classification result that is other than the first category set, and obtains the second target text.
  • the text in the text corresponding to the second classification result is a text corresponding to the fourth category classification (ie, the category with the category B in the second classification result), and the text is Second target text.
  • the text classification method in the present application finally obtains the classification information of the text is correct, even if the text to be classified includes multiple levels of classification, because it can be ensured that the classification information after the first level classification is accurate, even if it is later There are more levels of classification, as long as they are classified according to the text classification method in this application, the accuracy of the final classification result can be guaranteed.
  • the text classification method obtained by the embodiment of the present application obtains the text to be classified, uses the first classifier to classify the classified text, obtains the first classification result, and uses the second classifier to classify the incorrectly classified text in the first classification result.
  • the classification parameter of the second classifier is associated with the classification parameter of the first classifier, and based on the first classification result and the second classification result, the text corresponding to the first classification result and the second
  • the text corresponding to the classification result is processed to obtain the target text; thus, after the classification of the text to be classified, the text with the misclassification after the classification can be further classified, and the classification of the text obtained after reclassifying the text having the misclassified classification All are correct, which solves the problem of classification errors in the existing text classification scheme, improves the accuracy of text classification, and enhances maintainability and scalability.
  • An embodiment of the present application provides a text classification device 4, which is applied to a text classification method provided by the embodiment corresponding to FIG. 1 to FIG. 2, and the device includes: first acquisition Unit 41, first classification unit 42, second classification unit 43, and processing unit 44, wherein:
  • the first obtaining unit 41 is configured to acquire text to be classified.
  • the first classifying unit 42 is configured to classify the classified text by using the first classifier to obtain a first classification result.
  • the second classifying unit 43 is configured to classify the incorrectly classified text in the first classification result by using the second classifier to obtain a second classification result.
  • the classification parameter of the second classifier has an association relationship with the classification parameter of the first classifier.
  • the processing unit 44 is configured to process, according to the first classification result and the second classification result, the text corresponding to the first classification result and the text corresponding to the second classification result to obtain the target text.
  • the text classification device acquires the text to be classified, classifies the classified text by using the first classifier, obtains the first classification result, and uses the second classifier to classify the incorrectly classified text in the first classification result.
  • the classification parameter of the second classifier is associated with the classification parameter of the first classifier, and based on the first classification result and the second classification result, the text corresponding to the first classification result and the second
  • the text corresponding to the classification result is processed to obtain the target text; thus, after the classification of the text to be classified, the text with the misclassification after the classification can be further classified, and the classification of the text obtained after reclassifying the text having the misclassified classification All are correct, which solves the problem of classification errors in the existing text classification scheme, improves the accuracy of text classification, and enhances maintainability and scalability.
  • the apparatus further includes: a determining unit 45 and a second obtaining unit 46, wherein:
  • the determining unit 45 is configured to determine whether there is text in the first classification result that is incorrectly classified.
  • the second obtaining unit 46 is configured to obtain the incorrectly classified text in the first classification result if there is a text with an incorrect classification in the first classification result.
  • the second classification unit 43 includes: a first acquisition module 431, a setting module 432, and a classification module 433, where:
  • the first obtaining module 431 is configured to acquire feature information of the incorrectly classified text in the first classification result.
  • the setting module 432 is configured to set the classification parameter based on the feature information of the incorrectly classified text in the first classification result.
  • the classification module 433 is configured to classify the incorrectly classified text in the first classification result based on the classification parameter and adopt the second classifier to obtain the second classification result.
  • the classification parameter of the first classifier is generated according to the feature information of the text in the text to be classified.
  • the processing unit 44 includes: a second obtaining module 441 and a processing module 442, where:
  • the second obtaining module 441 is configured to obtain a category of the correctly classified text in the first classification result, to obtain a first category.
  • the first category includes at least one category.
  • the processing module 442 is configured to process, according to the first category and the second classification result, the text corresponding to the first classification result and the text corresponding to the second classification result to obtain the target text.
  • processing module 442 is specifically configured to perform the following steps:
  • the first target text is obtained.
  • the category is a text other than the first category set, and obtaining the second target text.
  • the target text includes a first target text and a second target text.
  • the classification parameter of the first classifier is different from the classification parameter of the second classifier.
  • the classification method used by the first classifier is the same as that used by the second classifier.
  • the classification method employed by the first classifier is different from the classification method employed by the second classifier.
  • the text classification device acquires the text to be classified, classifies the classified text by using the first classifier, obtains the first classification result, and uses the second classifier to classify the incorrectly classified text in the first classification result.
  • the classification parameter of the second classifier is associated with the classification parameter of the first classifier, and based on the first classification result and the second classification result, the text corresponding to the first classification result and the second
  • the text corresponding to the classification result is processed to obtain the target text; thus, after the classification of the text to be classified, the text with the misclassification after the classification can be further classified, and the classification of the text obtained after reclassifying the text having the misclassified classification All are correct, which solves the problem of classification errors in the existing text classification scheme, improves the accuracy of text classification, and enhances maintainability and scalability.
  • the embodiment of the present application provides a text classification device 5, which can be applied to a text classification method provided by the embodiment corresponding to FIG. 1 to FIG. Memory 51 and processor 52, wherein:
  • the memory 51 stores instructions executable by the processor 52, and when executed, the processor 52 is configured to:
  • the classification parameter of the second classifier is associated with the classification parameter of the first classifier; and based on the first classification result and the second classification result, the text corresponding to the first classification result and the text corresponding to the second classification result are processed to obtain the target text. .
  • FIG. 6B is a schematic structural diagram of a system 600B to which a text classification method according to an embodiment of the present application is applicable.
  • the system 600B includes at least a terminal device 601, a text server 602, and a network 603, which may also include a resource server 604.
  • the terminal device 601 refers to a terminal device 601 having a data calculation processing function, including but not limited to a smart phone (a handheld computer, a tablet computer, etc.) (with a communication module installed). Operating systems are installed on these terminal devices 601, including but not limited to: Android operating system, Symbian operating system, Windows mobile operating system, and Apple iPhone OS operating system.
  • the terminal device 601 is installed with an application (for example, a news APP or a reading PC application client), and the application is connected to the application server software installed in the text server 602 via the network 603 (for example, corresponding to the news APP or the reading PC application client).
  • the application server software performs information interaction (eg, sending a text acquisition request to the text server 602, accepting text information sent by the text server 602, etc.).
  • a text application server software is installed in the text server 602, and the text application server software provides a corresponding text resource (e.g., text information, etc.) for the application installed in the terminal device 601 via the network 603.
  • a corresponding text resource e.g., text information, etc.
  • the terminal device 601 can receive an acquisition instruction or a view instruction sent by the user (eg, obtain a news text instruction or view a news text instruction), and the acquisition instruction or the view instruction carries identification information (eg, text identification information), and the terminal device 601 responds to Obtaining an instruction or viewing an instruction, and sending an acquisition request or a viewing request to the text server 602, the acquisition request or the viewing request carrying identification information (eg, text identification information), and further carrying user identification information (eg, user ID), the text server 602
  • the terminal device 601 receives the text to be classified and stores the text to be classified in the local terminal device 601.
  • the terminal device 601 classifies the classified text by using a first classifier (eg, a logistic regression classifier) stored in the local terminal device 601 to obtain a first classification result (eg, A category and B category, where the A category may be Entertainment news, category B can be international news); terminal device 601 is stored in local terminal device 60
  • the second classifier in 1 eg, a logistic regression classifier
  • classifies the incorrectly classified text in the first classification result eg, the terminal device 601 determines the first classification result to determine the B category text.
  • the second classifier in the terminal device 601 classifies the text of the B category), and obtains the second classification result (eg, the A category and the B category, where the A category can be entertainment news, the B category can Is the international news), wherein the classification parameter of the second classifier has an association relationship with the classification parameter of the first classifier; the terminal device 601 compares the first classification based on the first classification result and the second classification result Resulting that the corresponding text and the text corresponding to the second classification result are processed to obtain the target text (eg, the terminal device 601 merges the text of the A category in the first classification result with the text of the A category in the second classification result).
  • the second classification result eg, the A category and the B category, where the A category can be entertainment news, the B category can Is the international news
  • the classification parameter of the second classifier has an association relationship with the classification parameter of the first classifier
  • the terminal device 601 compares the first classification based on the first classification result and the second classification result Resulting that the corresponding text
  • Target text ie, the text that has been correctly classified
  • Information on the target text should be displayed by obtaining instructions or viewing instructions that are displayed based on the final classification results (eg, title of category A entertainment news in category A entertainment news, in category B)
  • the headlines of international news in category B are displayed in international news so that users can easily view or obtain text according to their own needs.
  • the text server 602 can send an acquisition request to the resource server 604 (eg, a server for storing various kinds of text) through the network 603, the acquisition request carrying identification information (eg, text identification information), and the resource server 604 is responsive to the acquisition request.
  • the text server 602 will obtain a plurality of texts to be classified (eg, multiple entertainment news texts, multiple The international news text and the plurality of military news texts are stored in the local text server 602, and the text server 602 classifies the classified text using a first classifier (eg, a logistic regression classifier) stored in the local text server 602.
  • a first classifier eg, a logistic regression classifier
  • the first classification result (eg, category A, category B, and category C, where category A can be entertainment news, category B can be international news, category C can be military news text); text server 602 is stored in local text server 602 a second classifier in (eg, a logistic regression classifier) that incorrectly classifies text in the first classification result
  • the line classification (for example, after the text server 602 determines the first classification result, and determines that there is a classification error text in the B category text, then the second classifier in the text server 602 classifies the B category text),
  • Second classification result (eg, category A, category B, and category C, where category A can be entertainment news, category B can be international news, category C can be military news text), wherein the classification of the second classifier
  • the parameter has an association relationship with the classification parameter of the first classifier; the text server 602 performs, according to the first classification result and the second classification result, the text corresponding to the first classification result and the text corresponding to the second classification result.
  • the text server 602 combines the text of the A category and the text of the C category in the first classification result with the text of the A category and the text of the C category in the second classification result, and obtains the final result.
  • the text of the category A, the text of the category C, and the text of the category B in the second category result the text server 602 is getting the target text (ie, has been correctly classified)
  • the acquisition request carries identification information (eg, text identification information), and the text server 602 returns information of the target text to the terminal device 101 in response to the acquisition request (eg, already After receiving the information of the target text, the terminal device 101, after receiving the information of the target text, performs information for displaying the text according to the category of the target text in response to the user's acquisition instruction (eg, displaying entertainment news in the entertainment news category). Headline, showing international news headlines in the international news category, military news headlines in the military news category) so that users can view them as needed.
  • the resource server 604 is installed with resource application server software, and the resource application server software interacts with the text application server software in the text server 602 through the network 603 to provide corresponding text resources (eg, text information, etc.).
  • text resources eg, text information, etc.
  • the network 103 can be a wired network or a wireless network.
  • the text classification device obtains the text to be classified, uses the first classifier to classify the classified text, obtains the first classification result, and uses the second classifier to classify the incorrectly classified text in the first classification result.
  • the classification parameter of the second classifier is associated with the classification parameter of the first classifier, and based on the first classification result and the second classification result, the text corresponding to the first classification result and the second
  • the text corresponding to the classification result is processed to obtain the target text; thus, after the classification of the text to be classified, the text with the misclassification after the classification can be further classified, and the classification of the text obtained after reclassifying the text having the misclassified classification All are correct, which solves the problem of classification errors in the existing text classification scheme, improves the accuracy of text classification, and enhances maintainability and scalability.
  • the module 433, the second obtaining module 441, and the processing module 442 may each be a Central Processing Unit (CPU), a Micro Processor Unit (MPU), and a Digital Signal Processor (Digital Signal) located in the wireless data transmitting device. Processor, DSP) or Field Programmable Gate Array (FPGA) implementation.
  • CPU Central Processing Unit
  • MPU Micro Processor Unit
  • DSP Digital Signal Processor
  • FPGA Field Programmable Gate Array
  • FIG. 8 shows a compositional structure diagram 800 of specific hardware of the text sorting device 5.
  • the text sorting device 5 includes, in addition to one or more processors (CPUs) 802 and memory 806, a communication module 804, a user interface 810, and a communication bus 808 for interconnecting these components.
  • processors CPUs
  • memory 806 volatile and non-volatile memory
  • communication module 804 volatile and non-volatile memory
  • user interface 810 for interconnecting these components.
  • a communication bus 808 for interconnecting these components.
  • the processor 802 can receive and transmit data through the communication module 804 to effect network communication and/or local communication.
  • User interface 810 includes one or more output devices 812 that include one or more speakers and/or one or more visual displays.
  • User interface 810 also includes one or more input devices 814 including, for example, a keyboard, a mouse, a voice command input unit or loudspeaker, a touch screen display, a touch sensitive tablet, a gesture capture camera or other input button or control, and the like.
  • the memory 806 can be a high speed random access memory such as DRAM, SRAM, DDR RAM, or other random access solid state storage device; or a non-volatile memory such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, Or other non-volatile solid-state storage devices.
  • a high speed random access memory such as DRAM, SRAM, DDR RAM, or other random access solid state storage device
  • non-volatile memory such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, Or other non-volatile solid-state storage devices.
  • the memory 806 stores a set of instructions executable by the processor 802, including:
  • Operating system 816 including programs for processing various basic system services and for performing hardware related tasks
  • the application 818 includes various applications for video playback, and the application can implement the processing flow in each of the above examples, for example, may include some or all of the modules in the text sorting device 4 shown in FIG. 5, and each module 41 At least one of the modules 44 can store the machine executable instructions, and the processor 802 can implement the machine executable instructions in at least one of the modules 41-44 in the memory 806, thereby enabling the implementation of each of the modules 41-44 At least one module's functionality.
  • the hardware modules in the embodiments may be implemented in a hardware manner or a hardware platform plus software.
  • the above software includes machine readable instructions stored in a non-volatile storage medium.
  • embodiments can also be embodied as software products.
  • the hardware may be implemented by specialized hardware or hardware that executes machine readable instructions.
  • the hardware can be a specially designed permanent circuit or logic device (such as a dedicated processor such as an FPGA or ASIC) for performing a particular operation.
  • the hardware may also include programmable logic devices or circuits (such as including general purpose processors or other programmable processors) that are temporarily configured by software for performing particular operations.
  • each instance of the present application can be implemented by a data processing program executed by a data processing device such as a computer.
  • the data processing program constitutes the present application.
  • a data processing program usually stored in a storage medium is executed by directly reading a program out of a storage medium or by installing or copying the program to a storage device (such as a hard disk and or a memory) of the data processing device.
  • a storage medium also constitutes the present application
  • the present application also provides a non-volatile storage medium in which a data processing program is stored, which can be used to execute any of the above-described method examples of the present application.
  • the machine readable instructions corresponding to the modules in FIG. 5 may cause an operating system or the like operating on a computer to perform some or all of the operations described herein.
  • the non-transitory computer readable storage medium may be inserted into a memory provided in an expansion board within the computer or written to a memory provided in an expansion unit connected to the computer.
  • the CPU or the like installed on the expansion board or the expansion unit can perform part and all of the actual operations according to the instructions.
  • embodiments of the present application can be provided as a method, system, or computer program product. Accordingly, the application can take the form of a hardware embodiment, a software embodiment, or an embodiment in combination with software and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本分类方法和设备,所述方法包括:获取待分类文本(101);采用第一分类器对所述待分类文本进行分类,得到第一分类结果(102);采用第二分类器对所述第一分类结果中分类不正确的文本进行分类,得到第二分类结果,其中,所述第二分类器的分类参数与所述第一分类器的分类参数具有关联关系(103);基于所述第一分类结果和所述第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应的文本进行处理得到目标文本(104)。

Description

文本分类方法、设备和存储介质
本申请要求于2017年03月17日提交中国专利局、申请号为201710159632.6、发明名称为“一种文本分类方法、装置和设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域中的文本分类技术,尤其涉及一种文本分类方法、设备和存储介质。
发明背景
目前常用的文本分类器主要可以分成两大类:基于先验规则的文本分类器和基于模型的文本分类器。基于先验规则的文本分类器的分类规则需要靠人工挖掘或先验知识的积累,基于模型的文本分类器主要利用数据挖掘和机器学习的方法模型。在实际的文本分类应用中,无论是用哪种分类器模型经常会出现分类错误的问题,导致分类的准确率和召回率降低;而且在多层次分类中,如果上一级类出现错误直接会影响下面所有子分类的准确性。故如何准确地进行文本分类是解决上述问题的关键。
发明内容
为解决上述技术问题,本申请实施例期望提供一种文本分类方法、设备和存储介质,解决了现有的文本分类方案中存在分类错误的问题,提高了文本分类的准确率,增强了可维护性和可扩展性。
本申请的实施例提供一种文本分类方法,应用于计算设备,所述方法包括:
获取待分类文本;
采用第一分类器对所述待分类文本进行分类,得到第一分类结果;
采用第二分类器对所述第一分类结果中分类不正确的文本进行分类,得到第二分类结果;其中,所述第二分类器的分类参数与所述第一分类器的分类参数具有关联关系;
基于所述第一分类结果和所述第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应的文本进行处理得到目标文本。
本申请的实施例还提供一种文本分类设备,应用于计算设备,所述文本分类设备包括存储器和处理器,其中:
所述存储器中存储可被所述处理器执行的指令,当执行所述指令时,所述处理器用于:
获取待分类文本;
采用第一分类器对所述待分类文本进行分类,得到第一分类结果;
采用第二分类器对所述第一分类结果中分类不正确的文本进行分类,得到第二分类结果;其中,所述第二分类器的分类参数与所述第一分类器的分类参数具有关联关系;
基于所述第一分类结果和所述第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应的文本进行处理得到目标文本。
本申请实施例还提供了一种非易失性计算机可读存储介质,存储有计算机可读指令,可以使至少一个处理器执行如上述所述的方法。
附图简要说明
图1为本申请的实施例提供的一种文本分类方法的流程示意图;
图2为本申请的实施例提供的另一种文本分类方法的流程示意图;
图3为本申请的实施例提供的又一种文本分类方法的流程示意图;
图4为本申请的另一实施例提供的一种文本分类方法的流程示意图;
图5为本申请的实施例提供的一种文本分类装置的结构示意图;
图6A为本申请的实施例提供的另一种文本分类装置的结构示意图;
图6B为本申请的实施例提供的一种文本分类方法的适用的***结构示意图;
图7为本申请的实施例提供的一种文本分类设备的结构示意图;
图8为本申请的实施例提供的一种文本分类设备的硬件的结构示意图。
实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
可以采用以下两种方案来解决上述如何准确地进行文本分类的问题:一种方案是加入一系列的人工规则,修改分类错误的分类,但是规则通常不能覆盖所有的情况,而且还可能造成误修改;另一种方案是修改分类器模型,包括调整每个类别各自的特征,或修改分类器模型的参数。但是上述两种解决方案仍然存在无法准确修改分类的问题,还是会降低分类的准确率,可维护性与可扩展性变差。
本申请的实例的一种文本分类方法,可以应用于计算设备,其中,计算设备可以包括终端设备或服务器;参照图1所示,该方法包括以下步骤:
步骤101、获取待分类文本。
具体的,步骤101获取待分类文本可以是由文本分类装置来实现的;文本分类装置可以是能够实现对文本信息进行分类的设备,例如可以是能够对文本信息进行分类的移动终端或服务器。待分类文本可以是预先已经存储在移动终端或服务器中的一些需要对其进行分类处理的文本信息,获取待分类文本的一种可行的实现方案如下。用户需要对某些文本信息进行分类时,可以给移动终端或服务器发送文本信息获取指令,该获取指令中具有标识信息,移动终端或服务器接收到用户发送的获取指令后,可以从自身存储的文本信息中获取该标识信息对应的文本信息得到待分类文本信息;或者,另一种可行的实现方案是,用户需要对某些文本信息进行分类时可以给移动终端或服务器发送文本信息获取指令,该获取指令中具有标识信息,移动终端或服务器接收到用户发送的获取指令后,可以转发该获取指令给服务器,并从服务器处获取该标识信息对应的文本信息,最终得到待分类文本信息。
这里,所述文本可以为新闻、帖子、文章、产品说明(例如应用程序的简介)等,在实现的过程中只要需要进行分类的都可以作为本实施例中的文本。
步骤102、采用第一分类器对待分类文本进行分类,得到第一分类结果。
具体的,步骤102采用第一分类器对待分类文本进行分类,得到第一分类结果可以是由文本分类装置来实现的。第一分类结果可以是对待分类文本进行分类后得到的分类结果的信息,第一分类结果中可以包括至少两种分类信息。
以对包含多个文本或文章的文章R进行分类,且文章R中包括两个类别(A类别和B类别)为例进行说明,获取到文章R后,采用第一分 类器对文章R进行分类后可以得到两组分类,第一组分类的文本的类别均为A,第二组分类的文本中包括类别A和类别B的文本;其中,第二组分类中的类别为A的文本为A b,说明此次分类中将本应该划分到A类别中的文本A b错误的与B类别的文本划分到了第二组分类中,出现了分类不正确的问题。
应理解,上述实施例中的文本分类是将相同类别的文本分到一个组中,该组的类别可以根据该组中的文本类别确定,当该组中包括至少两个类别的文本时,则视为该组分类不正确。且因为该组包括至少两个类别的文本,该组的类别无论根据组中的哪个文本类别确定,都是不正确的。
步骤103、采用第二分类器对第一分类结果中分类不正确的文本进行分类,得到第二分类结果。
其中,第二分类器的分类参数与第一分类器的分类参数具有关联关系。
第一分类器的参数(即第一分类参数)是根据待分类文本中的文本的特征信息生成的。
第二分类器的参数(即第二分类参数)是根据进行第一次分类后存在错误的文本的特征信息设置的。
具体的,步骤103采用第二分类器对第一分类结果中分类不正确的文本进行分类,得到第二分类结果可以是由文本分类装置来实现的;第一分类器的分类参数与第二分类器的分类参数在设置的时候,第一分类器的分类参数的设置原则和第二分类器的分类参数的设置原则之间具有一定的关联关系。
第二分类结果中可以包括至少两种分类信息,这至少两种分类信息中的一种类别与第一分类结果中的一种类别相同。例如,第二分类结果 可以是对文章R进行分类后得到的类别包括类别A和类别B的第二组分类的文本进行再分类后得到的。采用第二分类器对第二组分类中的文本进行分类,分类后可以得到两组分类,第三组分类中的文本的类别是A,第四组分类中的文本的类别为B;第三组分类中的文本的类别与第一组分类中的文本的类别相同,均为A类别;而且,剩下的一组分类中的文本的类别均为B,不存在一组分类中包括有多个类别的文本的情况。而且,此次分类后的第二分类结果可以将第一分类结果中的错误分类的文本再分出来,最终形成的文本的分类信息均是正确的。
步骤104、基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本。
具体的,步骤104基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本可以是由文本分类装置来实现的。得到第一分类结果和第二分类结果后,可以查找第一分类结果和第二分类结果中共有的类别,然后将共有类别的文本合并为一个文本,最终得到目标文本。其中,目标文本中的每一个文本都属于同一类别。
本申请的实施例所提供的文本分类方法,获取待分类文本,采用第一分类器对待分类文本进行分类,得到第一分类结果,采用第二分类器对第一分类结果中分类不正确的文本进行分类,得到第二分类结果,第二分类器的分类参数与第一分类器的分类参数具有关联关系,基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本;这样,在对待分类文本进行分类之后,可以对分类之后存在错误分类的文本继续进行分类,经过对存在错误分类的文本的再次分类之后得到的文本的分类都是正确的,从而解决了现有的文本分类方案中存在分类错误的问题,提高了文本分类的准确 率,增强了可维护性和可扩展性。
基于前述实施例,本申请的实施例提供一种文本分类方法,该方法包括以下步骤:
步骤201、文本分类装置获取待分类文本。
步骤202、文本分类装置采用第一分类器对待分类文本进行分类,得到第一分类结果。
具体的,采用第一分类器对待分类文本进行分类可以是基于预先设置的分类参数,并采用第一分类器对待分类文本进行分类。预先设置的分类参数可以是根据待分类文本的特征信息生成的,待分类文本的特征信息可以是能够表征待分类文本的属性信息的参数,例如可以包括工具、乐器等。
第一分类器可以是基于先验规则的文本分类器,其分类规则需要靠人工挖掘或先验知识的积累来得到;也可以是基于模型的文本分类器,具体包括利用数据挖掘和机器学习的各种方法模型,例如最近邻分类器、逻辑回归分类器、决策树分类器、朴素贝叶斯分类器、支持向量机分类器、人工神经网络分类器等。
步骤203、文本分类装置判断第一分类结果中是否存在分类不正确的文本。
具体的,判断第一分类结果中是否存在分类不正确的文本可以是通过比较归为一个类别中的所有文本的类别是否相同来实现的;如果归为一个类别的文本中存在至少两个类别,说明这个类别中存在分类不正确的文本。
步骤204、若第一分类结果中存在分类不正确的文本,文本分类装置获取第一分类结果中分类不正确的文本。
具体的,若待分类的文本为文章R,采用第一分类器对文章R进行 分类后可以得到两组分类,第一组分类的文本的类别均是A,第二组分类的文本的类别中包括类别是A和类别是B的文本;所以第二组分类中存在分类不正确的文本,因此需要获取第一分类结果中的第二组分类对应的文本,即获取类别中包括类别是A和类别是B的分组对应的文本。
步骤205、文本分类装置获取第一分类结果中分类不正确的文本的特征信息。
具体的,在获取到第一分类结果中分类不正确的文本后,可以获取对应的文本的特征信息。此处的特征信息的定义与步骤202中关于特征信息的定义是相同的,只不过此处的特征信息是第一分类结果中分类不正确的文本的属性信息的参数。例如,可以是第二组分类对应的文本的属性信息的参数。
步骤206、文本分类装置基于第一分类结果中分类不正确的文本的特征信息,设置分类参数。
具体的,可以根据已经获取到的第一分类结果中分类不正确的文本的特征信息来设置第一分类器中的分类参数,最终实现对第一分类结果中分类不正确的文本的分类。
步骤207、文本分类装置基于分类参数并采用第二分类器,对第一分类结果中分类不正确的文本进行分类得到第二分类结果。
其中,第一分类器的分类参数是根据待分类文本中的文本的特征信息生成的。
具体的,第一分类器的分类参数与第二分类器的分类参数不同。第一分类器所采用的分类方法与第二分类器所采用的分类方法相同。
或者,第一分类器所采用的分类方法与第二分类器所采用的分类方法不同。
其中,第二分类器可以是基于先验规则的文本分类器,其分类规则 需要靠人工挖掘或先验知识的积累来得到;也可以是基于模型的文本分类器,具体包括利用数据挖掘和机器学习的各种方法模型,例如最近邻分类器、逻辑回归分类器、决策树分类器、朴素贝叶斯分类器、支持向量机分类器、人工神经网络分类器等。
例如第一分类器和第二分类器采用的分类方法都是逻辑回归分类器,在如图3中所示在对待分类文章R(快报文章)进行分类时可以先采用逻辑回归分类器并基于设置的第一分类参数(原分类模型)对文章R进行分类,得到两组分类结果,第一组分类中的文本的类别都是A(分类正确),第二组分类中文本的类别包括A和B(分类错误),将本应该分到第一组中的文本错误到分到了第二组中,并将其类别归类为B(即第二组分类);明显分类结果中存在错误分类的是第二组分类;之后,继续获取第二组分类对应的文本,采用逻辑回归分类器并基于设置的第二分类参数(新增分类模型)对包括有类别A和类别B的文本的第二组进行再分类,得到两组分类结果,第三组分类中的文本的类别为A(分类正确),第四组分类中文本的类别为B(分类正确)。此时,分类结果中的文本的类别均是正确的。其中第一分类参数是根据文章R的特征信息设置的,第二分类参数是根据第二组分类对应的文本的特征信息设置的。因为,第一次采用逻辑回归分类器进行分类时,第一分类参数是根据所有文本即文章R的特征信息设置的,因此第一分类结果中存在文本错误分类的问题;第二次采用逻辑回归分类器进行分类时,第二分类参数是根据进行第一次分类后存在错误的文本(即包括有类别A和类别B的文本)的特征信息设置的。因为,第二分类参数的设置更精准,所以经过第二次分类后得到的文本的分类结果都是正确的。
步骤208、文本分类装置基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标 文本。
具体的,在得到第一分类结果和第二分类结果后,可以比较第一分类结果和第二分类结果,并基于比较结果对第一分类结果对应的文本和第二分类结果对应的文本进行筛选和组成,最终得到目标文本。
需要说明的是,本实施例中与其它实施例中相同步骤或者概念的解释,可以参照其它实施例中的描述。
本申请的实施例所提供的文本分类方法,获取待分类文本,采用第一分类器对待分类文本进行分类,得到第一分类结果,采用第二分类器对第一分类结果中分类不正确的文本进行分类,得到第二分类结果,第二分类器的分类参数与第一分类器的分类参数具有关联关系,基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本;这样,在对待分类文本进行分类之后,可以对分类之后存在错误分类的文本继续进行分类,经过对存在错误分类的文本的再次分类之后得到的文本的分类都是正确的,从而解决了现有的文本分类方案中存在分类错误的问题,提高了文本分类的准确率,增强了可维护性和可扩展性。
基于前述实施例,本申请的实施例提供一种文本分类方法,参照图4所示,该方法包括以下步骤:
步骤301、文本分类装置获取待分类文本。
步骤302、文本分类装置采用第一分类器对待分类文本进行分类,得到第一分类结果。
步骤303、文本分类装置判断第一分类结果中是否存在分类不正确的文本。
步骤304、若第一分类结果中存在分类不正确的文本,文本分类装置获取第一分类结果中分类不正确的文本。
步骤305、文本分类装置获取第一分类结果中分类不正确的文本的特征信息。
步骤306、文本分类装置基于第一分类结果中分类不正确的文本的特征信息,设置分类参数。
步骤307、文本分类装置基于分类参数并采用第二分类器,对第一分类结果中分类不正确的文本进行分类得到第二分类结果。
其中,第一分类器的分类参数是根据待分类文本中的文本的特征信息生成的。
需要说明的是,第一分类器的分类参数与第二分类器的分类参数不同。第一分类器所采用的分类方法与第二分类器所采用的分类方法相同。
或者,第一分类器所采用的分类方法与第二分类器所采用的分类方法不同。
例如第一分类器采用的分类方法是逻辑回归分类器,第二分类器采用的分类方法是决策树分类器,在对待分类文章R进行分类时可以先采用逻辑回归分类器并基于设置的第一分类参数对文章R进行分类,得到三组分类结果,第一组分类中的文本的类别都是A,第二组分类中文本的类别包括A、B和C,第三组分类中的文本的类别都是C;明显分类结果中存在错误分类的是第二组分类;之后,继续获取第二组分类对应的文本,采用决策树分类器并基于设置的第二分类参数对包括有类别A、类别B和类别C的文本进行分类,得到三组分类结果,第三组分类中的文本的类别为A,第四组分类中文本的类别为B,第五组分类中文本的类别为C。此时,分类结果中的文本的类别均是正确的。其中第一分类参数是根据文章R的特征信息设置的,第二分类参数是根据第二组分类(即包括有类别A、类别B和类别C的分类)对应的文本的特征信息设 置的。因为,第一次采用逻辑回归分类器进行分类时第一分类参数是根据所有文本即文章R的特征信息设置的,因此第一分类结果中存在文本错误分类的问题;第二次采用决策树分类器进行分类时第二分类参数是根据进行第一次分类后存在错误的文本(即包括有类别A、类别B和类别C的文本)设置的。因为,第二分类参数的设置更精准,所以经过第二次分类后得到的文本的分类结果都是正确的。
步骤308、文本分类装置获取第一分类结果中分类正确的文本的类别,得到第一类别。
其中,第一类别中包括至少一种类别。
具体的,可以获取第一分类结果中分类正确的文本的类别为类别A和类别C,即第一类别可以是类别A和类别C。
步骤309、文本分类装置基于第一类别和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本。
具体的,步骤309基于第一类别和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本可以通过以下方式来实现:
步骤309a、文本分类装置基于第二分类结果,获取第二分类结果对应的文本中类别为第一类别的文本,得到第一文本集合。
具体的,分析第二分类结果从第二分类结果对应的文本中获取文本的类别为第一类别的文本,即可以是从第三组分类、第四组分类和第五组分类对应的文本中获取类别为A和C的文本,最终获取到的是第三组分类和第五组分类对应的文本得到第一文本集合。其中,第一文本集合中至少包括两个文本。在本实施例中第一文本集合中包括两种类别(如,A和C)的文本。
步骤309b、文本分类装置将第一文本集合和一个分类结果中分类正 确的文本中属于同一类别的文本结合,得到第一目标文本。
具体的,将得到第一文本集合中的类别为A的文本与第一组分类对应的文本结合,同时将第一文本集合中的类别为C的文本与第三组分类对应的文本结合,最终得到第一目标文本。需要说明的是,第一目标文本中包括至少一种类别的文本。
其中,目标文本包括第一目标文本和第二目标文本。
步骤309c、文本分类装置获取第二分类结果对应的文本中类别为除第一类别集合之外的文本,得到第二目标文本。
具体的,第二分类结果对应的文本中类别为除第一类别集合之外的文本为第四组分类(即第二分类结果中类别为B的分类)对应的文本,此时该文本即为第二目标文本。
本申请中的文本分类方法最终得到的文本的分类信息都是正确的,即使需要分类的文本包括有多层次的分类,因为已经可以保证第一层次分类后的分类信息是准确的,即使后面还有更多层次的分类,只要是按照本申请中的文本分类方法进行分类的都可以保证最终分类结果的准确性。
需要说明的是,本实施例中与其它实施例中相同步骤或者概念的解释,可以参照其它实施例中的描述。
本申请的实施例所提供的文本分类方法,获取待分类文本,采用第一分类器对待分类文本进行分类,得到第一分类结果,采用第二分类器对第一分类结果中分类不正确的文本进行分类,得到第二分类结果,第二分类器的分类参数与第一分类器的分类参数具有关联关系,基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本;这样,在对待分类文本进行分类之后,可以对分类之后存在错误分类的文本继续进行分类,经过对存在错 误分类的文本的再次分类之后得到的文本的分类都是正确的,从而解决了现有的文本分类方案中存在分类错误的问题,提高了文本分类的准确率,增强了可维护性和可扩展性。
本申请的实施例提供一种文本分类装置4,所述装置应用于图1~2、4对应的实施例提供的一种文本分类方法中,参照图5所示,该装置包括:第一获取单元41、第一分类单元42、第二分类单元43和处理单元44,其中:
第一获取单元41,用于获取待分类文本。
第一分类单元42,用于采用第一分类器对待分类文本进行分类,得到第一分类结果。
第二分类单元43,用于采用第二分类器对第一分类结果中分类不正确的文本进行分类,得到第二分类结果。
其中,第二分类器的分类参数与第一分类器的分类参数具有关联关系。
处理单元44,用于基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本。
本申请的实施例所提供的文本分类装置,获取待分类文本,采用第一分类器对待分类文本进行分类,得到第一分类结果,采用第二分类器对第一分类结果中分类不正确的文本进行分类,得到第二分类结果,第二分类器的分类参数与第一分类器的分类参数具有关联关系,基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本;这样,在对待分类文本进行分类之后,可以对分类之后存在错误分类的文本继续进行分类,经过对存在错误分类的文本的再次分类之后得到的文本的分类都是正确的,从而解决 了现有的文本分类方案中存在分类错误的问题,提高了文本分类的准确率,增强了可维护性和可扩展性。
进一步,参照图6A所示,该装置还包括:判断单元45和第二获取单元46,其中:
判断单元45,用于判断第一分类结果中是否存在分类不正确的文本。
第二获取单元46,用于若第一分类结果中存在分类不正确的文本,获取第一分类结果中分类不正确的文本。
具体的,参照图6A所示,第二分类单元43包括:第一获取模块431、设置模块432和分类模块433,其中:
第一获取模块431,用于获取第一分类结果中分类不正确的文本的特征信息。
设置模块432,用于基于第一分类结果中分类不正确的文本的特征信息,设置分类参数。
分类模块433,用于基于分类参数并采用第二分类器,对第一分类结果中分类不正确的文本进行分类得到第二分类结果。
其中,第一分类器的分类参数是根据待分类文本中的文本的特征信息生成的。
具体的,参照图6A所示,处理单元44包括:第二获取模块441和处理模块442,其中:
第二获取模块441,用于获取第一分类结果中分类正确的文本的类别,得到第一类别。
其中,第一类别中包括至少一种类别。
处理模块442,用于基于第一类别和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本。
进一步,处理模块442具体用于执行以下步骤:
基于第二分类结果,获取第二分类结果对应的文本中类别为第一类别的文本,得到第一文本集合。
将第一文本集合和第一分类结果中分类正确的文本中属于同一类别的文本结合,得到第一目标文本。
获取第二分类结果对应的文本中类别为除第一类别集合之外的文本,得到第二目标文本。
其中,目标文本包括第一目标文本和第二目标文本。
具体的,第一分类器的分类参数与第二分类器的分类参数不同。
第一分类器所采用的分类方法与第二分类器所采用的分类方法相同。
或者,第一分类器所采用的分类方法与第二分类器所采用的分类方法不同。
需要说明的是,本实施例中各个单元和模块之间的交互过程,可以参照图1~2、4对应的实施例提供的一种文本分类方法中的交互过程,此处不再赘述。
本申请的实施例所提供的文本分类装置,获取待分类文本,采用第一分类器对待分类文本进行分类,得到第一分类结果,采用第二分类器对第一分类结果中分类不正确的文本进行分类,得到第二分类结果,第二分类器的分类参数与第一分类器的分类参数具有关联关系,基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本;这样,在对待分类文本进行分类之后,可以对分类之后存在错误分类的文本继续进行分类,经过对存在错误分类的文本的再次分类之后得到的文本的分类都是正确的,从而解决了现有的文本分类方案中存在分类错误的问题,提高了文本分类的准确 率,增强了可维护性和可扩展性。
基于前述实施例,本申请的实施例提供一种文本分类设备5,可以应用于图1~2、4对应的实施例提供的一种文本分类方法中,参照图7所示,该设备包括:存储器51和处理器52,其中:
存储器51中存储可被所述处理器52执行的指令,当执行所述指令时,所述处理器52用于:
存储待分类文本;采用第一分类器对待分类文本进行分类,得到第一分类结果;采用第二分类器对第一分类结果中分类不正确的文本进行分类,得到第二分类结果;其中,第二分类器的分类参数与第一分类器的分类参数具有关联关系;基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本。
需要说明的是,本实施例中存储器与处理器之间的交互过程,可以参照图1~2、4对应的实施例提供的一种文本分类方法中的交互过程,此处不再赘述。
图6B示出了本申请实施例提供的一种文本分类方法适用的***600B结构示意图。该***600B至少包括终端设备601,文本服务器602以及网络603,该***还可以包括资源服务器604。
其中,终端设备601是指具有数据计算处理功能的终端设备601,包括但不限于(安装有通信模块的)智能手机、掌上电脑、平板电脑等。这些终端设备601上都安装有操作***,包括但不限于:Android操作***、Symbian操作***、Windows mobile操作***、以及苹果iPhone OS操作***等等。终端设备601安装有应用程序(如,新闻APP或阅读PC应用客户端),该应用程序通过网络603与文本服务器602中安装有 的应用服务器软件(如与新闻APP或阅读PC应用客户端对应的应用服务器软件)进行信息交互(如,发送文本获取请求至文本服务器602,接受文本服务器602发送的文本信息等)。
文本服务器602中安装有文本应用服务器软件,该文本应用服务器软件通过网络603为安装在终端设备601中的应用程序提供相应的文本资源(如,文本信息等)。
终端设备601可以接收用户发送的获取指令或查看指令(如,获取新闻文本指令或查看新闻文本指令),且获取指令或查看指令携带有标识信息(如,文本标识信息),终端设备601响应于获取指令或查看指令,向文本服务器602发送获取请求或查看请求,该获取请求或查看请求携带标识信息(如,文本标识信息),还可以携带用户标识信息(如,用户ID),文本服务器602响应于获取请求或查看请求,向终端设备601返回待分类文本(如,多个不同种类的新闻待分类文本),终端设备601接收到待分类文本并将待分类文本存储在本地终端设备601中,终端设备601采用存储在本地终端设备601中的第一分类器(如,逻辑回归分类器)对待分类文本进行分类,得到第一分类结果(如,A类别和B类别,其中A类别可以是娱乐新闻,B类别可以是国际新闻);终端设备601采用存储在本地终端设备601中的第二分类器(如,逻辑回归分类器)对第一分类结果中分类不正确的文本进行分类(如,经过终端设备601对第一分类结果进行判断,确定出B类别的文本中存在分类错误的文本,那么终端设备601中的第二分类器对B类别的文本进行分类),得到第二分类结果(如,A类别和B类别,其中A类别可以是娱乐新闻,B类别可以是国际新闻),其中,所述第二分类器的分类参数与所述第一分类器的分类参数具有关联关系;终端设备601基于第一分类结果和第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应 的文本进行处理得到目标文本(如,终端设备601将第一分类结果中的A类别的文本与第二分类结果中的A类别的文本,进行合并处理,得到最终的A类别的文本和第二分类结果中的B类别的文本),终端设备601在得到目标文本(即已经正确分类的文本)后,响应于获取指令或查看指令,展示该目标文本的信息(如,新闻标题),该目标文本根据最终分类结果进行展示(如,在A类别娱乐新闻中展示A类别娱乐新闻的标题,在B类别国际新闻中展示B类别国际新闻的标题),以便用户根据自己的需求方便地查看或获取文本。
文本服务器602可以通过网络603向资源服务器604(如,用于存储各种种类文本的服务器)发送获取请求,该获取请求携带标识信息(如,文本标识信息),资源服务器604响应于获取请求,返回多个待分类文本(如,多个娱乐新闻文本、多个国际新闻文本以及多个军事新闻文本),文本服务器602将获取到的多个待分类文本(如,多个娱乐新闻文本、多个国际新闻文本以及多个军事新闻文本)存储在本地文本服务器602中,文本服务器602采用存储在本地文本服务器602中的第一分类器(如,逻辑回归分类器)对待分类文本进行分类,得到第一分类结果(如,A类别、B类别和C类别,其中A类别可以是娱乐新闻,B类别可以是国际新闻,C类别可以是军事新闻文本);文本服务器602采用存储在本地文本服务器602中的第二分类器(如,逻辑回归分类器)对第一分类结果中分类不正确的文本进行分类(如,经过文本服务器602对第一分类结果进行判断,确定出B类别的文本中存在分类错误的文本,那么文本服务器602中的第二分类器对B类别的文本进行分类),得到第二分类结果(如,A类别、B类别和C类别,其中A类别可以是娱乐新闻,B类别可以是国际新闻,C类别可以是军事新闻文本),其中,所述第二分类器的分类参数与所述第一分类器的分类参数具有关联关系; 文本服务器602基于第一分类结果和第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应的文本进行处理得到目标文本(如,文本服务器602将第一分类结果中的A类别的文本和C类别的文本与第二分类结果中的A类别的文本和C类别的文本,进行合并处理,得到最终的A类别的文本、C类别的文本以及第二分类结果中的B类别的文本),文本服务器602在得到目标文本(即已经正确分类的文本)后,接收到终端设备101发送的获取请求,该获取请求携带标识信息(如,文本标识信息),文本服务器602响应于该获取请求向终端设备101返回目标文本的信息(如,已经正确分类的文本的标题以及类别),终端设备101接收到该目标文本的信息后,响应于用户的获取指令,根据目标文本的类别进行展示文本的信息(如,在娱乐新闻类别中展示娱乐新闻标题,在国际新闻类别中展示国际新闻标题,在军事新闻类别中展示军事新闻标题),以便用户能够根据需求进行查看。
其中,资源服务器604中安装有资源应用服务器软件,该资源应用服务器软件通过网络603与文本服务器602中的文本应用服务器软件进行信息交互,以提供相应的文本资源(如,文本信息等)。
网络103可以是有线网络,也可以是无线网络。
本申请的实施例所提供的文本分类设备,获取待分类文本,采用第一分类器对待分类文本进行分类,得到第一分类结果,采用第二分类器对第一分类结果中分类不正确的文本进行分类,得到第二分类结果,第二分类器的分类参数与第一分类器的分类参数具有关联关系,基于第一分类结果和第二分类结果,对第一分类结果对应的文本和第二分类结果对应的文本进行处理得到目标文本;这样,在对待分类文本进行分类之后,可以对分类之后存在错误分类的文本继续进行分类,经过对存在错误分类的文本的再次分类之后得到的文本的分类都是正确的,从而解决 了现有的文本分类方案中存在分类错误的问题,提高了文本分类的准确率,增强了可维护性和可扩展性。
在实际应用中,所述第一获取单元41、第一分类单元42、第二分类单元43、处理单元44、判断单元45、第二获取单元46、第一获取模块431、设置模块432、分类模块433、第二获取模块441和处理模块442均可由位于无线数据发送设备中的中央处理器(Central Processing Unit,CPU)、微处理器(Micro Processor Unit,MPU)、数字信号处理器(Digital Signal Processor,DSP)或现场可编程门阵列(Field Programmable Gate Array,FPGA)等实现。
图8示出了文本分类设备5的具体硬件的组成结构图800。如图8所示,该文本分类设备5除了包括一个或者多个处理器(CPU)802和存储器806,还可以包括通信模块804、用户接口810,以及用于互联这些组件的通信总线808。
处理器802可通过通信模块804接收和发送数据以实现网络通信和/或本地通信。
用户接口810包括一个或多个输出设备812,其包括一个或多个扬声器和/或一个或多个可视化显示器。用户接口810也包括一个或多个输入设备814,其包括诸如,键盘,鼠标,声音命令输入单元或扩音器,触屏显示器,触敏输入板,姿势捕获摄像机或其他输入按钮或控件等。
存储器806可以是高速随机存取存储器,诸如DRAM、SRAM、DDR RAM、或其他随机存取固态存储设备;或者非易失性存储器,诸如一个或多个磁盘存储设备、光盘存储设备、闪存设备,或其他非易失性固态存储设备。
存储器806存储处理器802可执行的指令集,包括:
操作***816,包括用于处理各种基本***服务和用于执行硬件相 关任务的程序;
应用818,包括用于视频播放的各种应用程序,这种应用程序能够实现上述各实例中的处理流程,比如可以包括图5所示的文本分类装置4中的部分或全部模块,各模块41-44中的至少一个模块可以存储有机器可执行指令,处理器802通过执行存储器806中各模块41-44中至少一个模块中的机器可执行指令,进而能够实现上述各模块41-44中的至少一个模块的功能。
需要说明的是,上述各流程和各结构图中不是所有的步骤和模块都是必须的,可以根据实际的需要忽略某些步骤或模块。各步骤的执行顺序不是固定的,可以根据需要进行调整。各模块的划分仅仅是为了便于描述采用的功能上的划分,实际实现时,一个模块可以分由多个模块实现,多个模块的功能也可以由同一个模块实现,这些模块可以位于同一个设备中,也可以位于不同的设备中。
各实施例中的硬件模块可以以硬件方式或硬件平台加软件的方式实现。上述软件包括机器可读指令,存储在非易失性存储介质中。因此,各实施例也可以体现为软件产品。
各实例中,硬件可以由专门的硬件或执行机器可读指令的硬件实现。例如,硬件可以为专门设计的永久性电路或逻辑器件(如专用处理器,如FPGA或ASIC)用于完成特定的操作。硬件也可以包括由软件临时配置的可编程逻辑器件或电路(如包括通用处理器或其它可编程处理器)用于执行特定操作。
另外,本申请的每个实例可以通过由数据处理设备如计算机执行的数据处理程序来实现。显然,数据处理程序构成了本申请。此外,通常存储在一个存储介质中的数据处理程序通过直接将程序读取出存储介质或者通过将程序安装或复制到数据处理设备的存储设备(如硬盘和或 内存)中执行。因此,这样的存储介质也构成了本申请,本申请还提供了一种非易失性存储介质,其中存储有数据处理程序,这种数据处理程序可用于执行本申请上述任何一种方法实例。
图5中的模块对应的机器可读指令可以使计算机上操作的操作***等来完成这里描述的部分或者全部操作。非易失性计算机可读存储介质可以是***计算机内的扩展板中所设置的存储器中或者写到与计算机相连接的扩展单元中设置的存储器。安装在扩展板或者扩展单元上的CPU等可以根据指令执行部分和全部实际操作。
本领域内的技术人员应明白,本申请的实施例可提供为方法、***、或计算机程序产品。因此,本申请可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功 能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述,仅为本申请的较佳实施例而已,并非用于限定本申请的保护范围。

Claims (15)

  1. 一种文本分类方法,其特征在于,应用于计算设备,所述方法包括:
    获取待分类文本;
    采用第一分类器对所述待分类文本进行分类,得到第一分类结果;
    采用第二分类器对所述第一分类结果中分类不正确的文本进行分类,得到第二分类结果;其中,所述第二分类器的分类参数与所述第一分类器的分类参数具有关联关系;
    基于所述第一分类结果和所述第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应的文本进行处理得到目标文本。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    判断所述第一分类结果中是否存在分类不正确的文本;
    若所述第一分类结果中存在分类不正确的文本,获取所述第一分类结果中分类不正确的文本。
  3. 根据权利要求1所述的方法,其特征在于,所述采用第二分类器对所述第一分类结果中分类不正确的文本进行分类,得到第二分类结果,包括:
    获取所述第一分类结果中分类不正确的文本的特征信息;
    基于所述第一分类结果中分类不正确的文本的特征信息,设置分类参数;
    基于所述分类参数并采用所述第二分类器,对所述第一分类结果中分类不正确的文本进行分类得到所述第二分类结果;其中,所述第一分类器的分类参数是根据所述待分类文本中的文本的特征信息生成的。
  4. 根据权利要求1所述的方法,其特征在于,所述基于所述第一 分类结果和所述第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应的文本进行处理得到目标文本,包括:
    获取所述第一分类结果中分类正确的文本的类别,得到第一类别;其中,所述第一类别中包括至少一种类别;
    基于所述第一类别和所述第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应的文本进行处理得到所述目标文本。
  5. 根据权利要求4所述的方法,其特征在于,所述基于所述第一类别和所述第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应的文本进行处理,得到所述目标文本,包括:
    基于所述第二分类结果,获取所述第二分类结果对应的文本中类别为所述第一类别的文本,得到第一文本集合;
    将所述第一文本集合和所述第一分类结果中分类正确的文本中属于同一类别的文本结合,得到第一目标文本;
    获取所述第二分类结果对应的文本中类别为除所述第一类别集合之外的文本,得到第二目标文本;其中,所述目标文本包括第一目标文本和第二目标文本。
  6. 根据权利要求1所述的方法,其特征在于,
    所述第一分类器的分类参数与所述第二分类器的分类参数不同;
    所述第二分类器的分类参数是根据进行第一次分类后存在错误的文本的特征信息设置的。
  7. 根据权利要求1所述的方法,其特征在于,所述第一分类器所采用的分类方法与所述第二分类器所采用的分类方法相同;
    或者,所述第一分类器所采用的分类方法与所述第二分类器所采用的分类方法不同。
  8. 一种文本分类设备,其特征在于,应用于计算设备,所述文本 分类设备包括存储器和处理器,其中:
    所述存储器中存储可被所述处理器执行的指令,当执行所述指令时,所述处理器用于:
    获取待分类文本;
    采用第一分类器对所述待分类文本进行分类,得到第一分类结果;
    采用第二分类器对所述第一分类结果中分类不正确的文本进行分类,得到第二分类结果;其中,所述第二分类器的分类参数与所述第一分类器的分类参数具有关联关系;
    基于所述第一分类结果和所述第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应的文本进行处理得到目标文本。
  9. 根据权利要求8所述的设备,其特征在于,所述处理器进一步用于:
    判断所述第一分类结果中是否存在分类不正确的文本;
    若所述第一分类结果中存在分类不正确的文本,获取所述第一分类结果中分类不正确的文本。
  10. 根据权利要求9所述的设备,其特征在于,所述处理器进一步用于:
    获取所述第一分类结果中分类不正确的文本的特征信息;
    基于所述第一分类结果中分类不正确的文本的特征信息,设置分类参数;
    基于所述分类参数并采用所述第二分类器,对所述第一分类结果中分类不正确的文本进行分类得到所述第二分类结果;其中,所述第一分类器的分类参数是根据所述待分类文本中的文本的特征信息生成的。
  11. 根据权利要求8所述的设备,其特征在于,所述处理器进一步用于:
    获取所述第一分类结果中分类正确的文本的类别,得到第一类别;其中,所述第一类别中包括至少一种类别;
    基于所述第一类别和所述第二分类结果,对所述第一分类结果对应的文本和所述第二分类结果对应的文本进行处理得到所述目标文本。
  12. 根据权利要求11所述的设备,其特征在于,所述处理器进一步用于:
    基于所述第二分类结果,获取所述第二分类结果对应的文本中类别为所述第一类别的文本,得到第一文本集合;
    将所述第一文本集合和所述第一分类结果中分类正确的文本中属于同一类别的文本结合,得到第一目标文本;
    获取所述第二分类结果对应的文本中类别为除所述第一类别集合之外的文本,得到第二目标文本;其中,所述目标文本包括第一目标文本和第二目标文本。
  13. 根据权利要求8所述的设备,其特征在于,所述第一分类器的分类参数与所述第二分类器的分类参数不同;
    所述第二分类器的分类参数是根据进行第一次分类后存在错误的文本的特征信息设置的。
  14. 根据权利要求9所述的设备,其特征在于,所述第一分类器所采用的分类方法与所述第二分类器所采用的分类方法相同;
    或者,所述第一分类器所采用的分类方法与所述第二分类器所采用的分类方法不同。
  15. 一种非易失性计算机可读存储介质,存储有计算机程序,该计算机程序用于执行所述权利要求1至7任一项所述的方法。
PCT/CN2018/079136 2017-03-17 2018-03-15 文本分类方法、设备和存储介质 WO2018166499A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710159632.6 2017-03-17
CN201710159632.6A CN108628873B (zh) 2017-03-17 2017-03-17 一种文本分类方法、装置和设备

Publications (1)

Publication Number Publication Date
WO2018166499A1 true WO2018166499A1 (zh) 2018-09-20

Family

ID=63522764

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/079136 WO2018166499A1 (zh) 2017-03-17 2018-03-15 文本分类方法、设备和存储介质

Country Status (2)

Country Link
CN (1) CN108628873B (zh)
WO (1) WO2018166499A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813932A (zh) * 2020-06-17 2020-10-23 北京小米松果电子有限公司 文本数据的处理方法、分类方法、装置及可读存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175236B (zh) * 2019-04-24 2023-07-21 平安科技(深圳)有限公司 用于文本分类的训练样本生成方法、装置和计算机设备
CN110990561B (zh) * 2019-10-14 2023-08-29 浙江华云信息科技有限公司 电力设备缺陷文本自动分类实现方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101059796A (zh) * 2006-04-19 2007-10-24 中国科学院自动化研究所 基于概率主题词的两级组合文本分类方法
CN101876987A (zh) * 2009-12-04 2010-11-03 中国人民解放军信息工程大学 一种面向类间交叠的两类文本分类方法
US20130317804A1 (en) * 2012-05-24 2013-11-28 John R. Hershey Method of Text Classification Using Discriminative Topic Transformation
CN105912625A (zh) * 2016-04-07 2016-08-31 北京大学 一种面向链接数据的实体分类方法和***
CN106339418A (zh) * 2016-08-15 2017-01-18 乐视控股(北京)有限公司 一种分类纠错方法及装置

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7366705B2 (en) * 2004-04-15 2008-04-29 Microsoft Corporation Clustering based text classification
US8438009B2 (en) * 2009-10-22 2013-05-07 National Research Council Of Canada Text categorization based on co-classification learning from multilingual corpora
US8868402B2 (en) * 2009-12-30 2014-10-21 Google Inc. Construction of text classifiers
EP2369505A1 (en) * 2010-03-26 2011-09-28 British Telecommunications public limited company Text classifier system
CN102194013A (zh) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 一种基于领域知识的短文本分类方法及文本分类***
CN103246655A (zh) * 2012-02-03 2013-08-14 腾讯科技(深圳)有限公司 一种文本分类方法、装置及***
US9798802B2 (en) * 2012-03-23 2017-10-24 Avast Software B.V. Systems and methods for extraction of policy information
CN103678271B (zh) * 2012-09-10 2016-09-14 华为技术有限公司 一种文本校正方法及用户设备
US8484025B1 (en) * 2012-10-04 2013-07-09 Google Inc. Mapping an audio utterance to an action using a classifier
CN105138913A (zh) * 2015-07-24 2015-12-09 四川大学 一种基于多视集成学习的恶意软件检测方法
CN105930411A (zh) * 2016-04-18 2016-09-07 苏州大学 一种分类器训练方法、分类器和情感分类***
CN106202372A (zh) * 2016-07-08 2016-12-07 中国电子科技网络信息安全有限公司 一种网络文本信息情感分类的方法
CN106126751A (zh) * 2016-08-18 2016-11-16 苏州大学 一种具有时间适应性的分类方法及装置
CN106503153B (zh) * 2016-10-21 2019-05-10 江苏理工学院 一种计算机文本分类体系

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101059796A (zh) * 2006-04-19 2007-10-24 中国科学院自动化研究所 基于概率主题词的两级组合文本分类方法
CN101876987A (zh) * 2009-12-04 2010-11-03 中国人民解放军信息工程大学 一种面向类间交叠的两类文本分类方法
US20130317804A1 (en) * 2012-05-24 2013-11-28 John R. Hershey Method of Text Classification Using Discriminative Topic Transformation
CN105912625A (zh) * 2016-04-07 2016-08-31 北京大学 一种面向链接数据的实体分类方法和***
CN106339418A (zh) * 2016-08-15 2017-01-18 乐视控股(北京)有限公司 一种分类纠错方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHU, JINGBO ET AL.: "Confusion Class Discrimination Techniques for Text Classification", JOURNAL OF SOFTWARE, vol. 19, no. 3, 31 March 2008 (2008-03-31), pages 632, ISSN: 1000-9825 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813932A (zh) * 2020-06-17 2020-10-23 北京小米松果电子有限公司 文本数据的处理方法、分类方法、装置及可读存储介质
CN111813932B (zh) * 2020-06-17 2023-11-14 北京小米松果电子有限公司 文本数据的处理方法、分类方法、装置及可读存储介质

Also Published As

Publication number Publication date
CN108628873A (zh) 2018-10-09
CN108628873B (zh) 2022-09-27

Similar Documents

Publication Publication Date Title
US11295215B2 (en) Automated dynamic data quality assessment
US20190087490A1 (en) Text classification method and apparatus
US20200012963A1 (en) Curating Training Data For Incremental Re-Training Of A Predictive Model
US11640556B2 (en) Rapid adjustment evaluation for slow-scoring machine learning models
US20110131551A1 (en) Graphical user interface input element identification
JP2019519009A (ja) データソースに基づく業務カスタマイズ装置、方法、システム及び記憶媒体
US20210049416A1 (en) Artificial intelligence system for inspecting image reliability
CN112400165B (zh) 利用无监督学习来改进文本到内容建议的方法和***
US11151015B2 (en) Machine-based recognition and dynamic selection of subpopulations for improved telemetry
US20150026582A1 (en) Graphical representation of classification of workloads
WO2018166499A1 (zh) 文本分类方法、设备和存储介质
EP3115907A1 (en) Common data repository for improving transactional efficiencies of user interactions with a computing device
CN112384909A (zh) 利用无监督学习来改进文本到内容建议的方法和***
US20200160086A1 (en) Unsupervised domain adaptation from generic forms for new ocr forms
WO2019056496A1 (zh) 图片复审概率区间生成方法及图片复审判定方法
WO2023011470A1 (zh) 一种机器学习***及模型训练方法
US11763075B1 (en) Method and system of discovering templates for documents
CN111143678B (zh) 推荐***和推荐方法
US20210176181A1 (en) Intelligent Conversion of Internet Domain Names to Vector Embeddings
WO2021154429A1 (en) Siamese neural networks for flagging training data in text-based machine learning
WO2023239468A1 (en) Cross-application componentized document generation
CN114332529A (zh) 图像分类模型的训练方法、装置、电子设备及存储介质
US20200073891A1 (en) Systems and methods for classifying data in high volume data streams
US20190050413A1 (en) Method and apparatus for providing search recommendation information
CN114063879B (zh) 用于处理操作命令的方法、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18768114

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18768114

Country of ref document: EP

Kind code of ref document: A1