CN115186057A

CN115186057A - Method and device for obtaining text classification model

Info

Publication number: CN115186057A
Application number: CN202210794143.9A
Authority: CN
Inventors: 方科
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-10-14

Abstract

The application discloses a method and a device for obtaining a text classification model, which can be applied to the field of artificial intelligence or the field of finance, wherein the method comprises the following steps: selecting target keywords contained in a target text from a keyword library, wherein the keyword library contains a plurality of keywords, and each keyword corresponds to one text category; determining a target category identification corresponding to the target text according to the target keyword; the target category identification is used for indicating a text category of the target text; training the artificial intelligence model through a target text containing a target category identifier to obtain a text classification model, wherein the text classification model is used for classifying the text. The target category identification of the target text can be determined through keywords without depending on manual labeling of the sample data. On one hand, the method and the device can reduce the cost of the training model and improve the speed of the training model. On the other hand, a large amount of sample data for training can be determined through the keywords, and the quality of the trained text classification model is guaranteed.

Description

Method and device for obtaining text classification model

Technical Field

The application relates to the field of artificial intelligence, in particular to a method and a device for obtaining a text classification model.

Background

Classifying text is a common means of collating textual information. In many scenes, the text has large data volume and high timeliness, and artificial analysis is almost impossible, such as customer service conversation text, product evaluation published by customers, massive financial information generated every moment and the like, so that the text needs to be automatically classified and labeled by a computer. However, when facing a specific task scenario, one of the important challenges often faced is that it has been determined by business that the text is divided into specific categories, but the collected samples have no category identification themselves. At this time, the category of the collected sample needs to be identified manually, and a training sample of the artificial intelligence model is obtained. However, manual labeling of a large amount of text is not only costly but also time consuming.

Disclosure of Invention

In order to solve the technical problem, the application provides a method and a device for obtaining a text classification model, which are used for obtaining a trained text classification model relatively quickly at a relatively low cost.

In order to achieve the above purpose, the technical solutions provided in the embodiments of the present application are as follows:

the embodiment of the application provides a method for obtaining a text classification model, which comprises the following steps:

selecting target keywords contained in a target text from a keyword library, wherein the keyword library contains a plurality of keywords, and each keyword corresponds to one text category;

determining a target category identification corresponding to the target text according to the target keyword; the target category identification is used for indicating a text category of the target text;

and training an artificial intelligent model through a target text containing the target category identification to obtain a text classification model, wherein the text classification model is used for classifying the text.

As a possible implementation manner, the training an artificial intelligence model through a target text containing the target class identifier to obtain a text classification model includes:

dividing the target texts containing the target category identifications into a training set and a test set;

training the artificial intelligence model through the training set to obtain a trained artificial intelligence model;

and testing the artificial intelligence model through the test set, and determining the artificial intelligence model passing the test as the text classification model.

As a possible implementation, the testing the artificial intelligence model through the text of the test set, and determining the artificial intelligence model passing the test as a text classification model includes:

inputting a first text in the test set into the trained artificial intelligence model to obtain a first category identifier corresponding to the first text;

judging whether a target category identification corresponding to the first text is consistent with the first category identification;

when the target category identification corresponding to the first text is consistent with the first category identification, transferring the first text from the test set to the training set;

when the test set does not contain text, determining the artificial intelligence model as a text classification model.

As a possible implementation manner, the inputting a first text in the test set into the trained artificial intelligence model, and obtaining a first category identifier corresponding to the first text includes:

inputting a first text in the test set into the trained artificial intelligence model to obtain a first class identifier corresponding to the first text and a confidence coefficient corresponding to the first class identifier;

the determining whether the target category identifier corresponding to the first text is consistent with the first category identifier includes:

and when the confidence corresponding to the first category identification is larger than a preset threshold value, judging whether the target category identification corresponding to the first text is consistent with the first category identification.

As a possible implementation, the method further includes:

inputting a second text in the test set into the trained artificial intelligence model to obtain a second category identification corresponding to the second text;

judging whether the target category identification corresponding to the second text is consistent with the second category identification;

and when the target class identification corresponding to the second text is inconsistent with the prediction class identification, training the artificial intelligence model through a training set containing the first text.

As a possible implementation manner, the second text in the test set is input into the trained artificial intelligence model, and a second category identifier corresponding to the second text is obtained;

inputting a second text in the test set into the trained artificial intelligence model to obtain a second category identification corresponding to the second text and a confidence coefficient corresponding to the second category identification;

the determining whether the target category identifier corresponding to the second text is consistent with the second category identifier includes:

and when the confidence corresponding to the second category identification is larger than a preset threshold, judging whether the target category identification corresponding to the second text is consistent with the second category identification.

and training an artificial intelligent model by adopting a supervised learning algorithm through a target text containing the target category identification to obtain a text classification model.

The embodiment of the present application further provides an obtaining apparatus of a text classification model, including:

the system comprises a selection module, a text classification module and a text classification module, wherein the selection module is used for selecting target keywords contained in a target text in a keyword library, the keyword library contains a plurality of keywords, and each keyword corresponds to one text category;

the determining module is used for determining a target category identifier corresponding to the target text according to the target keyword; the target category identification is used for indicating a text category of the target text;

and the training module is used for training an artificial intelligent model through a target text containing the target category identification to obtain a text classification model, and the text classification model is used for classifying the text.

As a possible implementation, the training module comprises:

the classification unit is used for classifying the target text containing the target class identification into a training set and a test set;

a training set training unit for training the artificial intelligence model through the training set to obtain a trained artificial intelligence model;

and the test unit tests the artificial intelligence model through the test set and determines the artificial intelligence model passing the test as the text classification model.

As a possible implementation, the test unit is specifically configured to:

inputting a first text in the test set into the trained artificial intelligence model to obtain a first class identifier corresponding to the first text;

judging whether the target category identification corresponding to the first text is consistent with the first category identification;

According to the technical scheme, the method has the following beneficial effects:

the embodiment of the application provides a method for obtaining a text classification model, which comprises the following steps: selecting target keywords contained in a target text from a keyword library, wherein the keyword library contains a plurality of keywords, and each keyword corresponds to one text category; determining a target category identification corresponding to the target text according to the target keyword; the target category identification is used for indicating a text category of the target text; training the artificial intelligence model through a target text containing a target category identifier to obtain a text classification model, wherein the text classification model is used for classifying the text.

Therefore, according to the method for obtaining the text classification model, the target category identification corresponding to the target text is determined by matching the target text with the keywords in the keyword library, and the artificial intelligent model is trained through the target text containing the target category identification to obtain the text classification model. Therefore, the method for obtaining the text classification model provided by the embodiment of the application can determine the target category identification of the target text through the keywords without manually marking the sample data. On the one hand, the method and the device can reduce the cost of the training model and improve the speed of the training model. On the other hand, a large amount of sample data for training can be determined through the keywords, and the quality of the trained text classification model is guaranteed.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for obtaining a text classification model according to an embodiment of the present disclosure;

fig. 2 is a flowchart of another method for obtaining a text classification model according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of an apparatus for obtaining a text classification model according to an embodiment of the present application.

Detailed Description

In order to help better understand the scheme provided in the embodiment of the present application, before the method provided in the embodiment of the present application is introduced, a scenario of an application of the scheme in the embodiment of the present application is introduced.

In order to solve the foregoing technical problem, an embodiment of the present application provides a method for obtaining a text classification model, including: selecting target keywords contained in a target text from a keyword library, wherein the keyword library contains a plurality of keywords, and each keyword corresponds to one text category; determining a target category identification corresponding to the target text according to the target keyword; the target category identification is used for indicating a text category of the target text; training the artificial intelligent model through a target text containing a target category identification to obtain a text classification model, wherein the text classification model is used for classifying the text.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the drawings are described in detail below.

Referring to fig. 1, this figure is a flowchart of a method for obtaining a text classification model according to an embodiment of the present application.

As shown in fig. 1, the method for obtaining a text classification model provided in the embodiment of the present application includes:

s101: and selecting a target keyword contained in the target text from a keyword library, wherein the keyword library contains a plurality of keywords, and each keyword corresponds to one text category.

S102: determining a target category identification corresponding to the target text according to the target keyword; the target category identifies a text category for indicating target text.

S103: training the artificial intelligence model through a target text containing a target category identifier to obtain a text classification model, wherein the text classification model is used for classifying the text.

It should be noted that the target text in the embodiment of the present application may be a plurality of texts or may be one text, and the embodiment of the present application is not limited herein. The method and the device can adopt a supervised learning algorithm, such as a Support Vector Machine (SVM), to train an artificial intelligent model to obtain a text classification model. In practical applications, the name of each category of the text may be obtained first, and a near Word group corresponding to the category name is obtained according to a near Word group module, for example, word2vec (Word to vector). And then storing the name of the category and the similar meaning phrase corresponding to the category into a keyword library, wherein the name of the category and the similar meaning phrase corresponding to the category correspond to the same category, and the name of the category and the category identification corresponding to the similar meaning phrase corresponding to the category are the same category identification.

As a possible implementation manner, in the present application, training an artificial intelligence model through a target text including a target category identifier to obtain a text classification model may include: dividing target texts containing target category identifications into a training set and a test set; training the artificial intelligence model through a training set to obtain a trained artificial intelligence model; and testing the artificial intelligence model through the test set, and determining the artificial intelligence model passing the test as a text classification model. As an example, 70% of the target text may be used as a training set and 30% as a test set.

Specifically, the first text in the test set can be input into the trained artificial intelligence model, and the first category identification corresponding to the first text is obtained; judging whether the target category identification corresponding to the first text is consistent with the first category identification; and when the target class identification corresponding to the first text is consistent with the first class identification, transferring the first text from the test set to the training set. And when the target category identification corresponding to the first text is not consistent with the first category identification, continuously storing the first text in the test set.

The second text in the test set can be input into the trained artificial intelligence model to obtain a second category identifier corresponding to the second text; judging whether the target class identification corresponding to the second text is consistent with the second class identification; and when the target class identification and the prediction class identification corresponding to the second text are not consistent, the second text is continuously stored in the test set. If the test set contains a third text, the third text product can be further processed in the above-mentioned flow until all the texts in the test set have been redistributed. It should be noted that, when the texts in the test set are all redistributed according to the above-mentioned process, the texts in the test set will be reduced, and the texts in the training set will be increased. At this time, the artificial intelligence model can be trained again through the training set containing the first text, namely the training set with the increased texts. And training the trained artificial intelligence model again through the test set until the texts in the test set are all transferred to the training set. When the test set does not contain the text, namely the text in the test set is reduced to 0, the artificial intelligence model is determined as the text classification model.

It should be noted that, because the cost of labeling by keywords is low, the embodiment of the application can obtain a large amount of target texts to train the artificial intelligence model. Because a large amount of target texts are used for training, the accuracy of the text classification model obtained by training can be superior to that of the original method for labeling through keywords.

In order to improve the training efficiency of the artificial intelligence model, whether the category identification of the text needs to be consistent or not can be determined according to the confidence coefficient of the prediction result of the artificial intelligence model. Specifically, in the present application, inputting a first text in a test set into a trained artificial intelligence model, and obtaining a first category identifier corresponding to the first text includes: and inputting the first text in the test set into the trained artificial intelligence model to obtain a first class identifier corresponding to the first text and a confidence coefficient corresponding to the first class identifier. Then, when the confidence corresponding to the first category identification is larger than a preset threshold, whether the target category identification corresponding to the first text is consistent with the first category identification is judged. When the confidence corresponding to the first category identifier is smaller than the preset threshold, the first category identifier may not be judged, and the first text may be continuously stored in the test set. Correspondingly, inputting a second text in the test set into the trained artificial intelligence model to obtain a second category identifier corresponding to the second text; inputting a second text in the test set into the trained artificial intelligence model to obtain a second category identification corresponding to the second text and a confidence coefficient corresponding to the second category identification; judging whether the target category identification corresponding to the second text is consistent with the second category identification, including: and when the confidence corresponding to the second category identification is larger than a preset threshold, judging whether the target category identification corresponding to the second text is consistent with the second category identification.

Referring to fig. 2, this figure is a flowchart of another method for obtaining a text classification model according to an embodiment of the present application.

In summary, as shown in fig. 2, the method for obtaining a text classification model provided in the embodiment of the present application includes: firstly, a near-meaning phrase of a category name is obtained, and keywords are hit on a text to obtain a text label (classification label). And then segmenting the text with the obtained classification label, segmenting the text into a test set and a labeled training set, wherein the text in the training set is used for training a text classification model. And when the prediction result simultaneously has results which are inconsistent and consistent with the test set, selecting the text with the consistent comparison result, adding the selected text into the training set, and deleting the text from the test set. And then, training the text classification model again by using the new training set until the prediction result is completely consistent with the labeling result in the test set, and obtaining a final classification model.

In summary, according to the method for obtaining the text classification model provided by the application, the target category identifier corresponding to the target text is determined by matching the target text with the keywords in the keyword library, and the artificial intelligent model is trained through the target text containing the target category identifier, so that the text classification model is obtained. Therefore, the method for obtaining the text classification model provided by the embodiment of the application can determine the target category identification of the target text through the keywords without manually marking the sample data. On the one hand, the method and the device can reduce the cost of the training model and improve the speed of the training model. On the other hand, a large amount of sample data for training can be determined through the keywords, and the quality of the trained text classification model is guaranteed.

According to the method for obtaining the text classification model provided by the embodiment, the embodiment of the application provides a device for obtaining the text classification model.

Referring to fig. 3, this figure is a schematic diagram of an apparatus for obtaining a text classification model according to an embodiment of the present application.

As shown in fig. 3, the text classification model provided in the embodiment of the present application includes:

a selection module 100, configured to select a target keyword included in a target text from a keyword library, where the keyword library includes a plurality of keywords, and each keyword corresponds to a text category;

a determining module 200, configured to determine a target category identifier corresponding to the target text according to the target keyword; the target category identification is used for indicating a text category of the target text;

the training module 300 is configured to train the artificial intelligence model through a target text including a target category identifier to obtain a text classification model, where the text classification model is used to classify the text.

As a possible implementation, the training module comprises: the classification unit is used for classifying the target texts containing the target category identifications into a training set and a test set; the training set training unit is used for training the artificial intelligence model through a training set to obtain a trained artificial intelligence model; and the test unit tests the artificial intelligence model through the test set and determines the artificial intelligence model passing the test as a text classification model.

As a possible implementation, the test unit is specifically configured to: inputting a first text in the test set into the trained artificial intelligence model to obtain a first class identifier corresponding to the first text; judging whether the target category identification corresponding to the first text is consistent with the first category identification; when the target category identification corresponding to the first text is consistent with the first category identification, transferring the first text from the test set to the training set; when the test set does not contain text, the artificial intelligence model is determined as a text classification model.

In summary, the device for obtaining the text classification model determines the target category identifier corresponding to the target text by matching the target text with the keywords in the keyword library, and trains the artificial intelligent model through the target text containing the target category identifier to obtain the text classification model. Therefore, the device for obtaining the text classification model provided by the embodiment of the application can determine the target category identifier of the target text through the keywords without manually labeling the sample data. On the one hand, the method and the device can reduce the cost of the training model and improve the speed of the training model. On the other hand, a large amount of sample data for training can be determined through the keywords, and the quality of the trained text classification model is guaranteed.

From the above description of the embodiments, it is clear to those skilled in the art that all or part of the steps in the method of the above embodiments may be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present application or portions contributing to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method described in the embodiments or some portions of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The method disclosed by the embodiment corresponds to the system disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the system part for description.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The foregoing description of the disclosed embodiments will enable those skilled in the art to make or use the invention in various modifications to these embodiments, which will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

It should be noted that the method and the device for obtaining the text classification model provided by the invention can be used in the fields of artificial intelligence, block chain, distribution, cloud computing, big data, internet of things, mobile internet, network security, chip, virtual reality, augmented reality, holography, quantum computing, quantum communication, quantum measurement, digital twinning, and finance. The above description is only an example, and does not limit the application field of the method and apparatus for obtaining the text classification model provided by the present invention.

Claims

1. A method for obtaining a text classification model is characterized by comprising the following steps:

2. The method of claim 1, wherein training an artificial intelligence model through a target text containing the target class identifier to obtain a text classification model comprises:

dividing the target text containing the target category identification into a training set and a test set;

3. The method of claim 2, wherein the testing the artificial intelligence model with the text of the test set, and determining the artificial intelligence model with the test as a text classification model comprises:

4. The method of claim 3, wherein the inputting the first text in the test set into the trained artificial intelligence model, and obtaining the first category identifier corresponding to the first text comprises:

5. The method of claim 3, further comprising:

inputting a second text in the test set into the trained artificial intelligence model to obtain a second category identifier corresponding to the second text;

judging whether the target class identification corresponding to the second text is consistent with the second class identification;

6. The method of claim 5, wherein the second text in the test set is input into the trained artificial intelligence model to obtain a second category identifier corresponding to the second text;

and when the confidence corresponding to the second category identification is greater than a preset threshold value, judging whether the target category identification corresponding to the second text is consistent with the second category identification.

7. The method of claim 1, wherein training an artificial intelligence model through a target text containing the target class identifier to obtain a text classification model comprises:

8. An apparatus for obtaining a text classification model, comprising:

9. The apparatus of claim 8, wherein the training module comprises:

10. The apparatus of claim 9, wherein the test unit is specifically configured to: