CN111831823B

CN111831823B - Corpus generation and model training method

Info

Publication number: CN111831823B
Application number: CN202010664773.5A
Authority: CN
Inventors: 李林峰; 黄海荣; 孔晓泉; 董泽朝; 宋寒风
Original assignee: Ecarx Hubei Tech Co Ltd
Current assignee: Ecarx Hubei Tech Co Ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2022-05-13
Anticipated expiration: 2040-07-10
Also published as: CN111831823A

Abstract

The embodiment of the invention provides a corpus generation and model training method, which relates to the technical field of artificial intelligence, and comprises the following steps: the method comprises the steps of obtaining the existing quantity of the existing sample corpora belonging to each target corpus category, and aiming at each target corpus category, taking the base line sample quantity as a reference, adjusting the existing sample corpora according to word slots contained in the existing sample corpora belonging to the target corpus category to generate new sample corpora, so that the sum of the generated new sample corpora and the quantity of the existing sample corpora corresponding to the target corpus category reaches the base line sample quantity. By applying the scheme provided by the embodiment of the invention to generate the corpora, the sum of the number of the existing sample corpora and the number of the new sample corpora belonging to each target corpus category can be balanced. And then the difference of the accuracy of the classification model obtained by training for the corpora between different corpora categories is small.

Description

Corpus generation and model training method

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a corpus generation and model training method.

Background

Currently, human-computer interaction is applied more and more, for example, a user may interact with an electronic device through linguistic data such as text and voice, so as to obtain a service provided by the electronic device. Based on the above situation, the electronic device needs to classify the corpus of the user, so as to provide services for the user according to the classification result.

For example, classifying the corpus may be identifying a user intent represented by the corpus, e.g., a user desires to listen to music, shop, etc.; it is also possible to determine the mood of the user represented by the corpus, such as excitement, anger, etc.

In the prior art, a neural network-based classification model is generally used to classify corpora, and therefore, sample corpora need to be collected in advance for model training to obtain the classification model. In order to enable the classification model to accurately classify the corpora belonging to various corpus categories, the sample corpora need to be collected for different corpus categories, however, the sample corpora quantity that can be collected is often different for different corpus categories, and the accuracy of classifying the corpora by the classification model obtained through training is greatly different among different corpus categories.

Disclosure of Invention

The embodiment of the invention aims to provide a corpus generation and model training method, so that the accuracy of classifying corpuses by a classification model obtained by training is smaller in the difference between different corpus categories.

The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a corpus generating method, where the method includes:

obtaining the existing quantity of the existing sample corpora belonging to each target corpus category, wherein the existing sample corpora are as follows: the method comprises the following steps of training the existing corpora of a classification model which is suitable for a target application scene and is used for classifying the corpora, wherein the target corpora are classified into the following categories: the corpus categories are matched with the classification results of the classification models and set aiming at the target application scenes;

and aiming at each target corpus category, taking the number of the baseline samples as a reference, adjusting the existing sample corpus according to word slots contained in the existing sample corpus belonging to the target corpus category to generate a new sample corpus, so that the sum of the number of the generated new sample corpus and the number of the existing sample corpus corresponding to the target corpus category reaches the number of the baseline samples.

In an embodiment of the present invention, the adjusting the existing sample corpus according to the word slots included in the existing sample corpus belonging to the target corpus category to generate the new sample corpus includes:

generating a new sample corpus according to at least one of the first generation mode, the second generation mode and the third generation mode:

the first generation mode is as follows: replacing word slots contained in the existing sample corpus belonging to the target corpus category with first information to generate a new sample corpus, wherein the first information is as follows: the preset word slots are the same as the word slot types of the word slots contained in the existing sample corpus and are not contained in the existing sample corpus;

the second generation mode is as follows: deleting second information in the existing sample corpus belonging to the target corpus category to generate a new sample corpus, wherein the second information is as follows: non-key characters other than word slots and existing words contained in existing sample corpus, said existing words including: the vocabulary except the word slot contained in the existing sample corpus;

the third generation mode is as follows: adding third information to positions outside word slots and existing vocabularies contained in the existing sample corpus belonging to the target corpus category to generate a new sample corpus, wherein the third information is as follows: and the preset characters do not influence the semantics expressed by the corpus.

In an embodiment of the present invention, the generating a new sample corpus according to at least one of a first generation manner, a second generation manner, and a third generation manner includes:

generating a new sample corpus according to the first generation mode;

judging whether the sum of the number of the generated new sample corpora and the number of the existing sample corpora corresponding to the target corpus class reaches the baseline sample number or not;

and if not, generating a new sample corpus according to at least one of the second generation mode and the third generation mode.

In an embodiment of the present invention, the generating a new sample corpus according to at least one of the second generation method and the third generation method includes:

generating a new sample corpus according to the second generation mode;

and if not, generating a new sample corpus according to the third generation mode.

In an embodiment of the present invention, the adjusting the existing sample corpus according to the word slots included in the existing sample corpus belonging to the target corpus category to generate the new sample corpus such that the sum of the number of the generated new sample corpus and the number of the existing sample corpus corresponding to the target corpus category reaches the baseline sample number includes:

adjusting the existing sample corpus belonging to the target corpus category according to word slots contained in the existing sample corpus belonging to the target corpus category to generate a new sample corpus;

and if not, copying the sample corpus belonging to the target corpus category, so that the sum of the copied corpus, the generated new sample corpus and the number of the existing sample corpus belonging to the target corpus category reaches the baseline sample number.

In one embodiment of the present invention, the baseline sample number is obtained by:

based on different preset quantities, aiming at each target corpus class, training an initial model of a classification model by using a preset quantity of sample corpora belonging to the target corpus class, and acquiring the accuracy rate of classifying the corpora belonging to the target corpus class by the trained classification model;

acquiring a preset quantity corresponding to each target corpus category when the accuracy rate corresponding to each target corpus category reaches a preset accuracy rate, and taking the preset quantity as the reference sample corpus quantity;

and taking the minimum number in the reference sample corpus number or any number larger than the minimum number as the baseline sample number.

In a second aspect, an embodiment of the present invention provides a model training method, where the method includes:

obtaining a sample corpus, wherein the sample corpus comprises: the method comprises the following steps of generating a new sample corpus according to an existing sample corpus, wherein the new sample corpus is as follows: a new sample corpus generated according to the method of any one of the first aspect;

and training a classification model for classifying the corpora by using the sample corpora.

In an embodiment of the present invention, the initial network parameters of the classification model are: training a neural network model for classification based on corpora in a general corpus to obtain network parameters, wherein the general corpus is as follows: a corpus not containing the sample corpus.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of the first aspect when executing a program stored in the memory.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of the second aspect when executing the program stored in the memory.

In a fifth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any one of the first aspect.

In a sixth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any one of the second aspects.

In a seventh aspect, an embodiment of the present invention further provides a computer program product including instructions, which, when executed on a computer, cause the computer to perform the method steps of any one of the above first aspects.

In an eighth aspect, embodiments of the present invention also provide a computer program product including instructions, which, when executed on a computer, cause the computer to perform the method steps of any of the second aspects.

The embodiment of the invention has the following beneficial effects:

in the scheme provided by the embodiment of the invention, the existing quantity of the existing sample corpus belonging to each target corpus category is obtained, and the existing sample corpus is adjusted according to the word slots contained in the existing sample corpus belonging to the target corpus category by taking the base line sample quantity as the reference according to each obtained existing quantity and aiming at each target corpus category to generate the new sample corpus. After the new sample corpus is generated, the sum of the number of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category reaches the baseline sample number, namely, the sum of the number of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category is close to that of the existing sample corpus and the number of the new sample corpus. On the basis, the classification model is trained by using the existing sample corpus and the new sample corpus together, so that the classification model obtained by training has small difference between different corpus categories in the accuracy of classifying the corpus.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a first corpus generating method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a second corpus generation method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a baseline sample number obtaining method according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a model training method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present invention;

fig. 6 is a schematic structural diagram of another electronic device provided in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the prior art, the number of collected sample corpora is often different for different corpus categories, so that after a classification model is trained based on the collected sample corpora, the accuracy of classifying the corpora by using the classification model is greatly different among different corpus categories, and in order to solve the problem, the embodiment of the invention provides a corpus generation and model training method.

In an embodiment of the present invention, a corpus generating method is provided, where the method includes:

obtaining the existing quantity of the existing sample corpora belonging to each target corpus category, wherein the existing sample corpora are as follows: the existing corpus of a classification model for training a target application scenario and classifying the corpus, the target corpus being of the following categories: and the corpus categories are matched with the classification results of the classification models and set aiming at the target application scenes.

After the new sample corpus is generated, the sum of the number of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category reaches the baseline sample number, namely, the sum of the number of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category is close to that of the existing sample corpus and the number of the new sample corpus. On the basis, the classification model is trained by using the existing sample corpus and the new sample corpus together, so that the classification model obtained by training has small difference between different corpus categories in the accuracy of classifying the corpus.

In another embodiment of the present invention, a model training method is provided, where the method includes:

obtaining a sample corpus, wherein the sample corpus comprises: the method comprises the following steps of generating a new sample corpus according to an existing sample corpus, wherein the new sample corpus is as follows: and generating a new sample corpus according to the corpus generating method.

As can be seen from the above, in the scheme provided in the embodiment of the present invention, the sample corpus used in training the classification model includes not only the existing sample corpus but also the new sample corpus generated according to the existing sample corpus, and the number of the sample corpuses belonging to each target corpus class is more balanced because the sum of the number of the new sample corpus and the number of the existing sample corpus of each target corpus class reaches the reference sample number, so that the classification model is trained using the sample corpuses, so that the accuracy of classifying the corpuses by the classification model obtained by training is smaller among different corpus classes.

The corpus generation and model training method provided by the embodiment of the invention is explained through a specific embodiment.

Referring to fig. 1, an embodiment of the present invention provides a flowchart of a first corpus generating method, where the method includes S101 and S102.

S101: the existing number of the existing sample corpora belonging to each target corpus category is obtained.

Wherein, the existing sample corpus is as follows: the existing corpora are used for training a classification model which is suitable for a target application scene and is used for classifying the corpora. Because the sample corpus is a corpus used for training the classification model, and the trained classification model can be applied to different target application scenarios, based on this, the target application scenarios can be understood as follows: the above classification model is applied to the scene.

Specifically, the classification model obtained by using corpus training suitable for the target application scenario can be more suitable for corpus classification in the target application scenario.

For example, the target application scenario may be to identify a song that the user desires to hear, and the existing sample corpus is a corpus representing the song that the user desires to hear. A classification model trained using corpora representing songs that a user desires to hear is more suitable for identifying songs that a user desires to hear.

The target corpus categories are: and the corpus categories are set aiming at the target application scene and are matched with the classification result of the classification model suitable for the target application scene.

Because the classification models applied to different application scenarios are different and the classification results of the classification models are different, in order to enable the classification models to output accurate classification results, the classification models need to be trained by using sample corpora matched with the classification results, so that the classification models can learn the characteristics of the sample corpora corresponding to different classification results. Therefore, the set target corpus categories are often different for different application scenarios, and each target corpus category corresponds to one classification result of the classification model.

For example, if the target application scenario is to identify the emotion of the user, the classification result of the classification model may be: the sample corpora need to include corpora representing happiness, sadness, anger, excitement, calmness, etc., so that the classification model can learn the features of the sample corpora corresponding to different emotions, and therefore the sample corpora can be classified into different target corpora categories according to the represented emotions, and the target corpora categories can include happiness, sadness, anger, excitement, calmness, etc.

Similarly, if the target application scenario is to identify the user intention, the classification result of the classification model may be: the user desires to listen to music, the user desires to watch a movie, the user desires to shop, etc., and the sample corpora may be classified into different target corpora categories according to the user's intentions, so that the target corpora categories may include the user desires to listen to music, the user desires to watch a movie, the user desires to shop, etc.

Specifically, the same existing sample corpus may represent different meanings in different target application scenarios, so that the category of the belonging target corpus changes according to the difference of the target application scenarios, and the category of the target corpus corresponds to the target application scenario. For example, there is a sample corpus "i want to listen to qili xiang of zhou jilun", in an application scenario for recognizing the intention of the user, the belonging target corpus category is the user's desire to listen to a song, and in an application scenario for recognizing the emotion of the user, the belonging target corpus category is calm.

S102: and aiming at each target corpus category, taking the number of the baseline samples as a reference, adjusting the existing sample corpus according to word slots contained in the existing sample corpus belonging to the target corpus category to generate a new sample corpus, so that the sum of the number of the generated new sample corpus and the number of the existing sample corpus corresponding to the target corpus category reaches the number of the baseline samples.

Specifically, since the number of characters included in one sentence in the daily term is not too large, when the number of characters included in the sample corpus is greater than the preset maximum number of characters, only the preset maximum number of characters in the sample corpus may be retained, so as to reduce the number of characters included in the sample corpus and reduce the amount of calculation in the process of generating a new sample corpus.

For example, the preset maximum number of characters may be 70, 75, etc.

Wherein, the word slots contained in the existing sample corpus are: and character strings capable of representing information searching conditions in the existing sample corpus.

The word slots may be marked word slots in the sample corpus.

For example, the word slot may be a singer name word slot, such as "Zhou Jilun", "Liu De Hua", etc., or a city name word slot, such as "Beijing", "Shanghai", etc.

The word slots contained in the existing sample corpus may also be: and identifying the word slots which are the same as the preset word slots in the existing sample corpus.

For example, the predetermined word slot may be a song word slot, which includes "qili xiang", "forgetting water", and the like, and the word slot included in the existing sample corpus may be identified according to the predetermined word slot, and if the existing sample corpus includes "qili xiang", the "qili xiang" is identified as the word slot in the existing sample corpus.

In addition, the existing sample corpus is adjusted to generate a new sample corpus, which indicates that the existing sample corpus is reserved, and the existing sample corpus is adjusted to generate the new sample corpus, so that the number of the sample corpuses is increased.

In an embodiment of the present invention, the new sample corpus may be generated according to at least one of the following first generation method-third generation method.

A first generation method: and replacing word slots contained in the existing sample corpus belonging to the target corpus category with the first information to generate a new sample corpus.

Wherein, the first information is: the preset word slots have the same word slot type with the word slots contained in the existing sample corpus and are not contained in the existing sample corpus.

Specifically, the categories to which the semantics represented by the word slots with the same word slot category belong are the same, like the word slots belonging to the category of the dish name word slot, such as "gongbu chicken" and "three-cup chicken", which all represent the dish name, the word slots in the existing sample corpus are replaced by the preset word slots with the same word slot category as the word slots contained in the existing sample corpus, and the influence on the semantics of the generated new sample corpus is small.

For example, the existing sample corpus "i want to eat the palace chicken dices", wherein the "palace chicken dices" is the word slot, and the word slot category to which the word slot belongs is the dish name word slot, then the word slots except the "palace chicken dices" in the preset dish name word slot, such as "three cups of chicken", "three delicacies on the ground", etc., can be used to replace the "palace chicken dices" in the existing sample corpus, and the new sample corpus "i want to eat three cups of chicken", "three delicacies on the ground", etc. is generated.

The second generation mode: and deleting the second information in the existing sample corpus belonging to the target corpus category to generate a new sample corpus.

Wherein the second information is: non-key characters other than word slots and existing words contained in the existing sample corpus, said existing words including: there are words in the sample corpus other than the word slots contained.

Specifically, the existing sample corpus may be subjected to word segmentation to determine words contained in the existing sample corpus.

Because the word slots and the vocabularies can represent specific semantics in the existing sample corpus, deleting the word slots and the vocabularies from the existing sample corpus can cause great influence on the semantics of the corpus, for example, changing the semantics of the corpus, deleting non-key characters except the word slots and the vocabularies in the existing sample corpus, and generating a new sample corpus, so that the difference between the semantics expressed by the generated new sample corpus and the semantics of the existing sample corpus is small.

In addition, since the more characters deleted, the greater the influence on the semantics of the generated new sample corpus, the number of characters included in the second information may be lower than the preset number of characters, so as to prevent the larger influence on the semantics of the generated new sample corpus, for example, the preset number of characters may be 2.

Wherein, the second information contains characters which may or may not be adjacent in the existing sample corpus.

For example, the existing sample corpus is "putting a liudeluxe song forgetting water", wherein "liudeluxe" belongs to the singer name word slot, "forgetting water" belongs to the song name slot, and "song" is a vocabulary, so that the second information may include one or more of "putting", "one", "first", and "four characters.

Therefore, if the second information can be the character "yes", the character "yes" is deleted, and a new sample corpus "forgets to put a liu de hua song in water" is generated. And if the second information can be the character 'one', deleting the character 'one' to generate a new sample corpus 'the song forgetting to love water from Liudebua'. The second information may also be the characters "one" and the characters "are deleted, the new sample corpus" forgetting to love the liudebua song is released ", and as can be seen from the above, after the second information in the existing sample corpus is deleted, the semantic meaning of the generated new sample corpus is still similar to that of the existing sample corpus.

In addition, in the case where the punctuation marks are included in the characters included in the sample corpus, since the punctuation marks have a small influence on the semantics of the sample corpus, the punctuation marks can be preferentially deleted as the second information.

The third generation mode: and adding third information to the word slot contained in the existing sample corpus belonging to the target corpus category and the position outside the existing vocabulary to generate a new sample corpus.

Wherein the third information is: and the preset characters do not influence the semantics expressed by the corpus.

Since the word slot and the vocabulary can represent specific semantics in the existing sample corpus, if the third information is added in the word slot and the vocabulary, the semantics represented by the word slot and the vocabulary can be influenced, so that the semantics of the existing sample corpus is greatly influenced, and therefore, the difference between the semantics expressed by the generated new sample corpus and the semantics expressed by the existing sample corpus is smaller by adding the third information at positions except the word slot and the vocabulary contained in the existing sample corpus.

The position outside the word slot and the vocabulary can be the head or the tail of the existing sample corpus.

Specifically, the third information that does not affect the semantics expressed by the corpus may be stop words, such as "good or bad", "can do", or words of tone, such as "bar", "la", or other forms.

For example, the existing sample corpus is "putting away the liu de hua's forgetting water", which includes the singer name word groove "liu de hua" and the word groove "forgetting water", so that the third information is added at a position other than "liu de hua" and "forgetting water". The third information may be a stop word "good or bad", the "good or bad" may be added to the tail of the existing sample corpus to generate a new sample corpus "forgetting to put a liu de hua well", the third information may also be a tone word "bar", and the "bar" may be added to the tail of the existing sample corpus to generate a new sample corpus "forgetting to put a liu de hua water bar".

In a case where the new sample corpus is generated in multiple ways, priorities of the different ways are different, and in an implementation manner of the present invention, the priority of the first generation way may be considered to be the highest.

The method comprises the steps of firstly generating a new sample corpus by a first generation mode aiming at each target corpus category, judging whether the sum of the number of the generated new sample corpuses and the number of existing sample corpuses corresponding to the target corpus category reaches a base line sample number or not after the new sample corpus is generated by the first generation mode, and generating the new sample corpus by using at least one of a second generation mode and a third generation mode under the condition that the base line sample number is not reached.

Specifically, since the position of the word slot in the existing sample corpus is known, the preset word slot is also known, and the semantic influence of replacing the word slot with the known position by the known preset word slot on the existing sample corpus is small, the new sample corpus can be generated preferentially by the first generation method. And if the number of the sample corpora belonging to each target corpus category does not reach the baseline sample number after the new sample corpora are generated in the first generation mode, generating the new sample corpora through at least one of the second generation mode and the third generation mode.

In addition, the generated new sample corpus may be regarded as an existing sample corpus, and when the new sample corpus is continuously generated in other ways, the new sample corpus is continuously generated on the basis of the existing sample corpus.

In another aspect of the present invention, the first generation method may be considered to have the highest priority, the second generation method may be considered to have the next highest priority, and the third generation method may be considered to have the lowest priority.

After a new sample corpus is generated by using a first generation method, if the sum of the number of the generated new sample corpus and the number of the existing sample corpus corresponding to the target corpus category does not reach a preset number, a new sample corpus is generated by using a second generation method, whether the sum of the number of the generated new sample corpus and the number of the existing sample corpus corresponding to the target corpus category reaches a baseline sample number or not is judged, and a new sample corpus is generated by using a third generation method under the condition that the baseline sample number is not reached.

When the second information is deleted by using the second generation mode, only the second information needing to be deleted needs to be determined, and then the second information is deleted from the existing sample corpus. When the third information is added by using the third generation method, the third information can be added to the existing sample corpus only by determining the position to which the third information needs to be added in addition to the third information needing to be added.

Therefore, the process of adjusting the sample corpus using the third generation method needs to perform more operations, and each operation will affect the semantics of the existing sample corpus. Therefore, the semantic influence on the existing sample corpus is smaller by adjusting the existing sample corpus in the second generation mode, so that the existing sample corpus is preferentially adjusted in the second generation mode to generate a new sample corpus, and the new sample corpus is generated in the third generation mode under the condition that the quantity of the existing sample corpus and the quantity of the new sample corpus after the new sample corpus is generated in the second generation mode and the quantity of the new sample corpus do not reach the quantity of the reference samples.

As can be seen from the above, after the new sample corpus is generated, the sum of the number of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category both reach the baseline sample number, that is, the sum of the number of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category is close to that of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category. On the basis, the classification model is trained by using the existing sample corpus and the new sample corpus together, so that the classification model obtained by training has small difference between different corpus categories in the accuracy of classifying the corpus.

In addition, after the new sample corpus is generated, besides the sum of the number of the new sample corpus belonging to each target corpus category is close to the sum of the number of the existing sample corpus, the sum of the number of the new sample corpus belonging to each target corpus category and the number of the existing sample corpus at least reaches the base line sample number, namely the number of the sample corpuses belonging to each target corpus category is large, so that the corpus classification accuracy is high by using the classification model trained by using the sample corpuses.

Referring to fig. 2, a schematic flow chart of a second corpus generating method is provided, and compared with the foregoing embodiment shown in fig. 1, the step S102 can be implemented by steps S102A-S102C.

S102A: and aiming at each target corpus category, adjusting the existing sample corpus belonging to the target corpus category according to a word slot contained in the existing sample corpus belonging to the target corpus category to generate a new sample corpus.

Specifically, the step S102A can be implemented by the first generation method to the third generation method, and will not be described herein again.

S102B: and judging whether the sum of the number of the generated new sample corpora and the number of the existing sample corpora corresponding to the target corpus class reaches the baseline sample number or not.

Specifically, since the generated new sample corpus is not a real corpus, and a sample corpus with a wrong semantic meaning may be generated in the process of generating the new sample corpus, if the sum of the number of the generated new sample corpus and the number of the existing sample corpuses belonging to the target corpus category has reached the baseline sample number, no further new sample corpus is generated, so as to prevent the generation of a new sample corpus with a wrong semantic meaning. Otherwise, if the new sample corpus needs to be generated, step S102C is executed.

S102C: if the determination result in the step S102B is negative, the sample corpus belonging to the target corpus class is copied, so that the sum of the copied corpus, the generated new sample corpus, and the number of existing sample corpora belonging to the target corpus class reaches the baseline sample number.

Specifically, the sample corpus belonging to the target corpus category includes: the generated new sample corpus and the existing sample corpus belonging to the target corpus category.

The sample corpus belonging to the target corpus category may be randomly selected for one or more copies until the sum of the generated new sample corpus and the number of existing sample corpora belonging to the target corpus category reaches the baseline sample number.

For example, if the number of the baseline sample is 500 and the sum of the generated new sample corpus and the existing sample corpus belonging to the target corpus category is 300, 200 corpora may be randomly selected from the generated new sample corpus and the existing sample corpus belonging to the target corpus category for one-time replication.

As can be seen from the above, after the existing sample corpus belonging to the target corpus category is adjusted according to the word slots included in the existing sample corpus belonging to the target corpus category to generate the new sample corpus, and the sum of the generated new sample corpus and the number of the existing sample corpus belonging to the target corpus category does not reach the baseline sample number, the sample corpus belonging to the target corpus category is copied, so that the number of the sample corpuses belonging to the target category corpus can reach the baseline sample number. And the characters in the sample linguistic data are not changed by copying, so that the semantics of the sample linguistic data cannot be changed.

Referring to fig. 3, an embodiment of the present invention provides a flowchart of a method for obtaining a baseline sample number, and the baseline sample number can be obtained through the following steps S301 to S303.

S301: based on different preset quantities, aiming at each target corpus class, training an initial model of a classification model by using a preset quantity of sample corpora belonging to the target corpus class, and acquiring the accuracy of classifying the corpora belonging to the target corpus class by the trained classification model.

For example, the preset numbers may be 100, 200, 300, 400, and 500, respectively.

Taking the preset number as 100 as an example, the number of the sample corpora belonging to each target corpus category in the sample corpora of the classification model is 100, the classification model is trained by using the sample corpora, the classification model after training can be tested by using the test corpora, and the accuracy rate of the classification model after training for identifying the corpora belonging to different target corpus categories is obtained.

S302: and acquiring the preset quantity corresponding to each target corpus category when the accuracy rate corresponding to each target corpus category reaches the preset accuracy rate, and taking the preset quantity as the reference sample corpus quantity.

Specifically, the accuracy rate corresponding to each target corpus class reaches the preset accuracy rate, and the classification model after training is used, wherein the number of the sample corpuses belonging to each target corpus class is the preset number, and the classification model has a good classification effect on the corpuses belonging to each target corpus class.

For example, the predetermined accuracy may be 95%, 97%, or the like.

S303: and taking the minimum number in the reference sample corpus number or any number larger than the minimum number as the baseline sample number.

Specifically, when the accuracy corresponding to each target corpus category reaches the preset accuracy, there may be a plurality of preset numbers corresponding to each target corpus category, for example, the preset numbers of 400 to 500 may both satisfy the above condition.

Because the classification model after training is used for the group of sample corpuses of which the quantities of the sample corpuses belonging to the target corpus categories are the reference sample corpuses, the classification model has a good classification effect on the corpuses belonging to the target corpus categories, and therefore, the quantity of the baseline samples only needs to be larger than the minimum quantity of the reference sample corpuses.

However, if the number of the baseline samples is large, more new sample corpora need to be generated to enable the number of the existing sample corpora and the number of the new sample corpora to reach the number of the baseline samples, and the generated new sample corpora are not really existing corpora, so that the generated new sample corpora may have a problem of semantic errors, and the more the generated new sample corpora are, the more the sample corpora having semantic errors may be, so that the minimum number in the reference sample corpus number is selected as the number of the baseline samples, so that the trained classification model has a good classification effect on corpora belonging to each target corpus category, and the number of the generated new sample corpora having semantic errors can be reduced.

As can be seen from the above, when the obtained number of the baseline samples is such that the accuracy rate corresponding to each target corpus category reaches a higher accuracy rate, the number of the sample corpora corresponding to each target corpus category, and the number of the sample corpora belonging to each target corpus category reach the number of the baseline samples, so that the classification model obtained by training has a better classification effect on the corpora belonging to each target corpus category.

Referring to fig. 4, a flow diagram of a model training method is provided, the method including steps S401-S402.

S401: and obtaining the sample corpus.

Wherein, the above-mentioned sample corpus includes: the method comprises the following steps of generating an existing sample corpus and a new sample corpus generated according to the existing sample corpus.

The new sample corpus is as follows: and generating a new sample corpus according to any one of the corpus generating methods.

The following describes information related to a target application scenario and a target corpus category involved in generating a new sample corpus.

In this embodiment, the target application scenario may be understood as a scenario of a classification model obtained by training using the scheme provided by this embodiment. The target corpus category is a possible corpus category obtained when the corpus classification is performed by using the classification model.

In addition, the existing sample corpus may include predictions belonging to different target corpus categories.

Specifically, the classification results of the classification models applied to different target application scenarios are different, for example, the classification result applied to the classification model in the user intention identification application scenario is that the user desires to listen to music, the user desires to watch movies, and the like, and the classification result applied to the classification model in the user emotion identification application scenario is happy, sad, and the like.

The sample corpus is a corpus used for training the classification model, and the trained classification model is applied to different target application scenarios, so that the sample corpus suitable for the target application scenario to which the classification model is applied needs to be selected for training the classification model.

In addition, the classification model can classify the corpus into different target corpus categories, so that the classification model has a better classification effect on the corpus belonging to different target corpus categories, the corpus belonging to different target corpus categories needs to be included in the sample corpus used in the process of training the classification model, that is, the existing sample corpus includes the corpus belonging to different target corpus categories.

Specifically, the method for generating the new sample corpus is the same as the embodiment shown in fig. 1-2, and is not repeated herein.

S402: and training a classification model for classifying the corpora by using the sample corpora.

In an embodiment of the present invention, the initial network parameters of the classification model may be: and training the neural network model for classification based on the corpus in the general corpus to obtain network parameters.

Specifically, the general corpus includes corpora suitable for different application scenarios.

In an embodiment of the invention, the classification model may be a corpus-intended classification model. The corpus intent classification model comprises a convolution layer, a pooling layer, a fusion layer, a full-link layer and an output layer.

Specifically, the convolutional layer is used for extracting features in the corpus to be processed. The method may include a plurality of convolutional layers, and perform feature extraction on the character strings with different lengths in the corpus to be processed, for example, the method may include 3 convolutional layers, and perform feature extraction on the character strings with lengths of 3, 4, and 5 in the corpus to be processed, respectively.

And inputting the features in the linguistic data to be processed extracted by the convolutional layer into a pooling layer, wherein the pooling layer is used for removing redundant features in the received features and down-sampling the received features. The down-sampling method may be maximum value sampling, average value sampling, random sampling, or the like. The number of the pooling layers may be the same as the number of the convolutional layers, and the pooling layers correspond to the convolutional layers one to one, for example, 3 convolutional layers may be included, the features of the character strings with the lengths of 3, 4, and 5 in the corpus to be processed are extracted, and the features of the extracted character strings with the lengths of 3, 4, and 5 are pooled.

Inputting the pooled features of the pooling layers into a fusion layer, and fusing the pooled features of the pooling layers into a fusion feature, wherein the fusion feature can be expressed in an array form, and the array can be a one-dimensional array.

Inputting the fusion characteristics obtained from the fusion layer into the full-link layer, and obtaining the probability that the corpus to be processed belongs to each preset intention, for example, the probability that the corpus to be processed belongs to the intention of listening to songs is 70%, the probability of seeing a movie is 10%, the probability of eating an intention is 20%, and the like.

And determining the intention represented by the to-be-processed corpus according to the probability input and output layer of each preset intention of the to-be-processed corpus obtained by the full connection layer, wherein the preset intention with the highest probability corresponding to the to-be-processed corpus can be determined as the intention represented by the to-be-processed corpus.

In addition, before the linguistic data to be processed is input into the convolutional layer, the linguistic data to be processed can be input into the input layer, each character in the linguistic data to be processed is converted into a form of a digital index number to be expressed, an index number array corresponding to the linguistic data to be processed is obtained, and the linguistic data to be processed is expressed in the form of the index number array. Wherein, the corresponding relation between each character and the number index number is a preset corresponding relation. For example, the character "I" corresponds to a numeric index number of 1, and "you" corresponds to a numeric index number of 2, and so on.

And inputting the index number array obtained from the input layer into a word vector layer, converting each digital index number contained in the index number array into a word vector form for representation, and obtaining the corpus vector of the corpus to be processed. Inputting the obtained corpus vector into the convolutional layer.

In another embodiment of the present invention, the classification model may also be a named entity recognition model. The named entity recognition model includes a two-way LSTM (Long Short-Term Memory) network, a full connection layer, a CRF (Conditional Random Field) and an output layer.

Specifically, the bidirectional LSTM network may extract a logical relationship between characters in the corpus to be processed. Because the same characters have different meanings under the condition of different arrangement sequences, for example, the characters contained in "i love you" and "i love me" are the same, but because the different meanings expressed by the different arrangement sequences of the characters, the arrangement sequence of the characters needs to be considered under the condition of determining the named entity in the corpus to be processed, so that the bidirectional LSTM is adopted, that is, the LSTM processing is performed once according to the arrangement sequence of the characters in the corpus to be processed from front to back, and the LSTM processing is performed once according to the arrangement sequence of the characters in the corpus to be processed from back to front, so that the arrangement sequence of the characters in the corpus to be processed from front to back or the arrangement sequence of the characters in the corpus to be processed from back to front can be considered as the processing result.

The full connection layer determines the number of each preset named entity category to which each character in each linguistic data to be processed may belong according to a front-back logical relationship between characters in the linguistic data to be processed extracted by the bidirectional LSTM network, for example, the preset named entity categories may be a singer name category, a song name category, a dish name category, and the like, and for the linguistic data to be processed, "i want to listen to the water of the history of liudelua", it may be determined that the character "water" may belong to the song name category and the dish name category, and then the number of the preset named entity categories to which the character "water" may belong is 2.

Inputting the number of each preset named entity category to which each character possibly belongs, which is obtained by the full connection layer, into the CFR, performing Viterbi decoding, determining the combination with the highest accuracy in the combinations of the named entity categories to which each character belongs, thereby determining the named entity category to which each character belongs, and determining adjacent characters which are the same in the named entity categories to which each character belongs as a named entity, thereby determining each named entity in the corpus to be processed.

In addition, before the linguistic data to be processed is input into the bidirectional LSTM network, the linguistic data to be processed can be input into the input layer, each character in the linguistic data to be processed is converted into a form of a digital index number to be expressed, and an index number array corresponding to the linguistic data to be processed is obtained.

And inputting the index number array obtained from the input layer into a word vector layer, converting each digital index number contained in the index number array into a word vector form for representation, and obtaining the corpus vector of the corpus to be processed. The resulting corpus vectors are input into a bidirectional LSTM network.

In the case where the classification model is a corpus intent classification model, the initial network parameters of the classification model may be parameters of convolutional layers. In the case where the classification model is a named entity recognition model, the initial network parameters of the classification model may be parameters of a bidirectional LSTM network.

The network parameter may be a locally received network parameter sent by another electronic device, that is, another electronic device trains a neural network model for classification using corpora in a general corpus, and then sends the trained network parameter of the neural network model for classification to the electronic device deploying the classification model as an initial network parameter of the classification model. The network parameters can also be obtained by locally and directly training a neural network model for classification according to the linguistic data in the general linguistic data library.

Wherein, the general corpus is: a corpus that does not contain the sample corpora.

Therefore, the neural network model for classification is trained through the linguistic data contained in the general linguistic data library to obtain the initial network parameters of the classification model, and the classification model is trained continuously by using the sample linguistic data on the basis of the initial network parameters. Because under the condition that the quantity of the sample corpora is less, the classification model is directly trained, and the classification model is not easy to be trained to be converged, the corpora contained in the general corpus is used for training the neural network model for classification in advance, so that the neural network model for classification can learn the characteristics of the corpora in the general corpus in advance, the training is continued on the basis, and the classification model can be trained to be converged more easily under the condition that the quantity of the sample corpora is less.

In one embodiment of the present invention, step S402 can be implemented by steps a-B.

Step A: and vectorizing the sample corpus based on the character vector obtained in advance to obtain the corpus vector of the sample corpus.

Wherein, the character vector is: and vectorizing the linguistic data in the general corpus to obtain a vector of the character.

Specifically, the pre-obtained character vectors corresponding to different characters are different, and during the vectorization process of the sample corpus, the pre-obtained character vectors corresponding to each character can be determined to form the corpus vector of the sample corpus.

The vectorization processing can be performed on each character in the corpus contained in the general corpus to obtain a vector of a character corresponding to each character as a character vector obtained in advance.

For example, the character H corresponds to the character vector H, and the character vector of the character H included in the sample corpus may be determined as the character vector H.

And B: and training a classification model for classifying the corpus by using the obtained corpus vector.

As can be seen from the above, the vector of the character obtained by vectorizing the corpus in the universal corpus by the pre-obtained character vector is more corpus contained in the universal corpus, so that the accuracy of the pre-obtained character vector is higher, and the accuracy of the corpus vector of the sample corpus obtained by vectorizing the sample corpus by using the pre-obtained character vector with higher accuracy is higher.

An embodiment of the present invention further provides an electronic device, as shown in fig. 5, which includes a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502 and the memory 503 complete mutual communication through the communication bus 504,

a memory 503 for storing a computer program;

the processor 501 is configured to implement any of the method steps in the corpus generating method embodiment when executing the program stored in the memory 503.

When the corpus is generated by the electronic equipment provided by the embodiment of the invention, after the new sample corpus is generated, the sum of the number of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category both reach the base line sample number, namely, the sum of the number of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category are close to each other. On the basis, the classification model is trained by using the existing sample corpus and the new sample corpus together, so that the classification model obtained by training has small difference between different corpus categories in the accuracy of classifying the corpus.

Another electronic device is provided in the embodiments of the present invention, as shown in fig. 6, and includes a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to implement the method steps of any of the model training method embodiments when executing the program stored in the memory 603.

When the electronic device provided by the embodiment of the invention is applied to model training, in the scheme provided by the embodiment of the invention, the sample corpus used for training the classification model not only comprises the existing sample corpus, but also comprises the new sample corpus generated according to the existing sample corpus, and the number sum of the new sample corpus and the existing sample corpus of each target corpus category reaches the reference sample number, so that the number of the sample corpuses belonging to each target corpus category is more balanced, and the sample corpuses are generated based on the balance of the number of the sample corpuses belonging to each target corpus category, so that the classification model is trained by using the sample corpuses, and the accuracy of classifying the corpuses by the classification model obtained by training is smaller among different corpus categories.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when being executed by a processor, implements the method steps in any of the above corpus generating method embodiments.

When the corpus is generated by using the computer program stored in the computer-readable storage medium provided in this embodiment, after the new sample corpus is generated, the sum of the number of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category both reach the baseline sample number, that is, the sum of the number of the existing sample corpus and the number of the new sample corpus belonging to each target corpus category are close to each other. On the basis, the classification model is trained by using the existing sample corpus and the new sample corpus together, so that the classification model obtained by training has small difference between different corpus categories in the accuracy of classifying the corpus.

In a further embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, which, when being executed by a processor, carries out the method steps of any one of the above-mentioned model training method embodiments.

When the computer program stored in the computer-readable storage medium provided in this embodiment is used to perform model training, in the solution provided in this embodiment of the present invention, the sample corpus used in training the classification model includes not only the existing sample corpus but also the new sample corpus generated according to the existing sample corpus, and the sum of the number of the new sample corpus and the number of the existing sample corpus for each target corpus category reaches the number of the reference samples, so the number of the sample corpuses belonging to each target corpus category is relatively balanced, and therefore, the classification model is trained using the sample corpuses, so that the accuracy of classifying corpuses by the classification model obtained by training is relatively small among different corpus categories.

In another embodiment of the present invention, there is also provided a computer program product including instructions, which when run on a computer, causes the computer to perform the method steps of any of the above corpus generating method embodiments.

When the computer program product provided by this embodiment is executed to generate the corpus, after the new sample corpus is generated, the sum of the amounts of the existing sample corpus and the new sample corpus belonging to each target corpus category is made to reach the baseline sample amount, that is, the sum of the amounts of the existing sample corpus and the new sample corpus belonging to each target corpus category is close to each other. On the basis, the classification model is trained by using the existing sample corpus and the new sample corpus together, so that the classification model obtained by training has small difference between different corpus categories in the accuracy of classifying the corpus.

In a further embodiment provided by the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method steps of any of the above described model training method embodiments.

When the computer program product provided in this embodiment is executed to perform model training, in the scheme provided in the embodiment of the present invention, the sample corpora used in training the classification model include not only the existing sample corpora, but also the new sample corpora generated according to the existing sample corpora, and since the sum of the number of the new sample corpora and the number of the existing sample corpora for each target corpus category all reach the number of the reference samples, the number of the sample corpora belonging to each target corpus category is relatively balanced, and therefore, the classification model is trained using the sample corpora, so that the accuracy of classifying the corpora by the classification model obtained by training is relatively small between different corpus categories.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the electronic device, the computer-readable storage medium and the computer program product, since they are substantially similar to the method embodiments, the description is relatively simple, and in relation to what can be referred to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A corpus generation method, comprising:

aiming at each target corpus category, taking the number of baseline samples as a reference, adjusting the existing sample corpus according to word slots contained in the existing sample corpus belonging to the target corpus category to generate a new sample corpus, so that the sum of the number of the generated new sample corpus and the number of the existing sample corpus corresponding to the target corpus category reaches the number of the baseline samples;

wherein the baseline sample number is obtained by: based on different preset quantities, aiming at each target corpus class, training an initial model of a classification model by using a preset quantity of sample corpora belonging to the target corpus class, and acquiring the accuracy rate of classifying the corpora belonging to the target corpus class by the trained classification model; acquiring a preset quantity corresponding to each target corpus category when the accuracy rate corresponding to each target corpus category reaches a preset accuracy rate, and taking the preset quantity as the reference sample corpus quantity; taking the minimum number or any number larger than the minimum number in the reference sample corpus as the baseline sample number;

the method for adjusting the existing sample corpus according to the word slots contained in the existing sample corpus belonging to the target corpus category to generate a new sample corpus includes: generating a new sample corpus according to a first generation mode; judging whether the sum of the number of the generated new sample corpora and the number of the existing sample corpora corresponding to the target corpus class reaches the baseline sample number or not; if not, generating a new sample corpus according to at least one of the second generation mode and the third generation mode;

2. The method according to claim 1, wherein the generating a new sample corpus according to at least one of the second generation method and the third generation method comprises:

generating a new sample corpus according to the second generation mode;

3. The method according to claim 1, wherein the adjusting the existing sample corpus according to the word slots included in the existing sample corpus belonging to the target corpus category to generate a new sample corpus such that the sum of the generated new sample corpus and the number of the existing sample corpus corresponding to the target corpus category reaches the baseline sample number comprises:

4. A method of model training, the method comprising:

obtaining a sample corpus, wherein the sample corpus comprises: the method comprises the following steps of generating a new sample corpus according to an existing sample corpus, wherein the new sample corpus is as follows: a new sample corpus generated according to the method of any one of claims 1-3;

5. The method of claim 4, wherein the initial network parameters of the classification model are: training a neural network model for classification based on corpora in a general corpus to obtain network parameters, wherein the general corpus is as follows: a corpus not containing the sample corpus.

6. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-3 or 4-5 when executing a program stored in the memory.

7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 3 or 4 to 5.