CN114372514A - Training method, device, equipment and storage medium of cheating text recognition model - Google Patents

Training method, device, equipment and storage medium of cheating text recognition model Download PDF

Info

Publication number
CN114372514A
CN114372514A CN202111564589.4A CN202111564589A CN114372514A CN 114372514 A CN114372514 A CN 114372514A CN 202111564589 A CN202111564589 A CN 202111564589A CN 114372514 A CN114372514 A CN 114372514A
Authority
CN
China
Prior art keywords
model
cheating
corpus
training
version
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111564589.4A
Other languages
Chinese (zh)
Inventor
李迪
马晶义
宋丹丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111564589.4A priority Critical patent/CN114372514A/en
Publication of CN114372514A publication Critical patent/CN114372514A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

The application discloses a training method, a training device, equipment and a storage medium of a cheating text recognition model, and relates to the technical field of computers, in particular to the artificial intelligence fields of natural language processing technology, deep learning and the like. The specific implementation scheme is as follows: acquiring a newly added corpus; acquiring a first training corpus from the newly added corpus; generating a second training corpus according to the historical corpus and the first training corpus; determining a target version model from the historical versions of the cheating text recognition models; and performing incremental training on the target version model based on incremental learning according to the second training corpus, and taking the model obtained after the incremental training as the cheating text recognition model of the latest version. According to the method and the device, the model does not need to be retrained completely, the model development time is reduced, the model can timely cope with the rapidly-changing online cheating content, and the accuracy of the cheating text recognition model is improved.

Description

Training method, device, equipment and storage medium of cheating text recognition model
Technical Field
The application relates to the technical field of computers, in particular to the field of artificial intelligence such as natural language processing technology and deep learning, and particularly relates to a training method, a device, equipment and a storage medium of a cheating text recognition model.
Background
With the development of the internet, as communication platforms such as internet forums, personal home pages, chat software and the like are increasing, users can publish texts to express their own opinions or communicate with other users, and the massive text contents inevitably contain a large amount of cheating texts such as promotion, sensitivity, violence tendency and the like. The cheating text content not only affects the user experience and disturbs the network environment, but also can bring social hazards, and the cheating text needs to be identified in the massive text content and shielded. Machine learning models, typified by deep learning, have become the primary way to identify cheating text.
Disclosure of Invention
The application provides a training method, a device, equipment and a storage medium for a cheating text recognition model.
According to a first aspect of the present application, there is provided a training method of a cheating text recognition model, including:
acquiring a newly added corpus;
acquiring a first training corpus from the newly added corpus;
generating a second training corpus according to the historical corpus and the first training corpus;
determining a target version model from the historical versions of the cheating text recognition models;
and performing incremental training on the target version model based on incremental learning according to the second training corpus, and taking the model obtained after the incremental training as the cheating text recognition model of the latest version.
According to a second aspect of the present application, there is provided a training apparatus for a cheating-text-recognition model, comprising:
and the first acquisition module is used for acquiring the newly added corpora.
And the second acquisition module is used for acquiring the first training corpus from the newly added corpus.
And the generating module is used for generating a second training corpus according to the historical corpus and the first training corpus.
And the determining module is used for determining a target version model from the historical versions of the cheating text recognition models.
And the training module is used for carrying out incremental training on the target version model based on incremental learning according to the second training corpus, and taking the model obtained after the incremental training as the cheating text recognition model of the latest version.
According to a third aspect of the present application, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of training a cheating-text recognition model according to the first aspect.
According to a fourth aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the training method of the cheating-text recognition model of the first aspect.
According to a fifth aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the training method of the cheating-text recognition model according to the first aspect described above.
According to the technical scheme, the target version model is determined in the historical version of the cheating text recognition model, fine adjustment is carried out on the basis of the target version model based on an incremental learning mode, changes caused by newly-added cheating data are updated, the model does not need to be trained completely, the model development time is shortened, the model training cost is reduced, the model can timely cope with the quickly-changed online cheating content, the model timeliness is improved, and therefore online cheating residues can be effectively reduced. In addition, the target version model is subjected to incremental training based on incremental learning according to the first training corpus and the historical corpus in the newly added corpora, so that the model can identify the newly added cheating corpus and cannot lose the capacity of identifying the historical cheating corpus, the problem that the performance of the model is rapidly reduced due to the fact that new knowledge interferes with old knowledge when the model continuously obtains knowledge from non-stable data distribution is effectively solved, and the accuracy of the cheating text identification model is further improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
fig. 1 is a schematic flowchart of a training method for a cheating-text recognition model according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a training method of a cheating-text recognition model according to a second embodiment of the present application;
FIG. 3 is a flowchart illustrating a training method of a cheating-text recognition model according to a third embodiment of the present application;
fig. 4 is a schematic flowchart of a training method of a cheating text recognition model according to the fourth embodiment of the present application;
fig. 5 is a block diagram of a training apparatus for a cheating text recognition model according to a fifth embodiment of the present application;
fig. 6 is a block diagram illustrating a structure of a training apparatus for a cheating-text recognition model according to a sixth embodiment of the present application;
fig. 7 is a block diagram illustrating a training apparatus for a cheating-text recognition model according to a seventh embodiment of the present application;
fig. 8 is a block diagram of an electronic device to implement a training method of a cheating-text-recognition model according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the terms "first", "second", "third" and "fourth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," "third," or "fourth" may explicitly or implicitly include at least one of the feature.
It should be noted that, in the technical solution of the present application, the acquisition, storage, application, and the like of the personal information of the related user all conform to the regulations of the relevant laws and regulations, and do not violate the customs of the public order.
At present, cheating texts on a network are mostly identified according to dimensions such as user behaviors, user attributes, text contents and the like. Among them, identification based on the dimension of text content is the most basic and one of the most effective techniques. In the related art, a main way to identify a cheating text is a machine learning model represented by deep learning. The traditional method is to collect a large amount of cheating texts and non-cheating texts on the line, train the model according to the collected text contents, and then apply the model to the task of identifying the cheating texts on the line. However, the content of the online cheating text is varied and changed rapidly, and when the online cheating text is changed, the existing model cannot meet the business requirements, and the steps of model training and application need to be repeated to collect the cheating text and the non-cheating text again.
Therefore, the application provides a training method, a device, equipment and a storage medium of a cheating text recognition model. Specifically, a training method, an apparatus, a device, and a storage medium of a cheating-text recognition model according to embodiments of the present application are described below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a training method for a cheating text recognition model according to an embodiment of the present application. The method for training a cheating-text recognition model according to the embodiment of the present application may be applied to a device for training a cheating-text recognition model according to the embodiment of the present application, and the device for training a cheating-text recognition model may be configured on an electronic device.
As shown in fig. 1, the training method of the cheating text recognition model may include the following steps:
and 101, acquiring the newly added corpus.
It should be noted that, in the embodiment of the present application, the newly added corpus may include an audit corpus and other corpora. The audit corpus can be a corpus which is randomly extracted from a part of the newly added online corpuses and is manually labeled. The other corpora may be newly added cheating corpora identified according to the user behavior, user attributes and other dimensions.
And 102, acquiring a first training corpus from the newly added corpus.
As an example, a portion of the first corpus may be randomly extracted from the new corpus. For example, a portion of the corpus may be randomly extracted from the corpus to be added as the first corpus according to a certain ratio.
And 103, generating a second corpus according to the history corpus and the first corpus.
It should be noted that, in the embodiment of the present application, the historical corpus may be understood as a new corpus used when the historical version of the cheating text recognition model is subjected to incremental training.
It should be further noted that, in the embodiment of the present application, a corpus resampling method may be adopted, and a part of the newly added corpus used in the incremental training of each historical version of the cheating text recognition model is collected as the historical corpus. And using the collected historical corpus and the first training corpus as a second training corpus together. As an example, assume l is the incremental training time of the historical version of the cheating text recognition model and t is the present incremental training time of the cheating text recognition model, where t > l. In the incremental training of the cheating text recognition model with the incremental training time t, a part of the historical corpus can be collected as a second training corpus from a newly added corpus (such as the above historical corpus) used in the historical incremental training in a resampling mode. The formula used in the resampling mode is expressed by the following formula (1):
p(t,l)=e-λ(t-l) (1)
wherein p (t, l) is the probability that each corpus sample at the l incremental training time is added into the training set at the t incremental training time; the hyper-parameter lambda is determined manually by trial and error. That is, for the incremental training, a part of the history corpus may be resampled from the history corpus by using the above formula (1) and used as the second corpus together with the first corpus.
And 104, determining a target version model from the historical versions of the cheating text recognition models.
It should be noted that, in the embodiment of the present application, the historical version may include an initial version model and a current version model obtained by incrementally training the model each time before the model is currently incrementally trained. The initial version model can be understood as a version model which is trained by using no training corpus after the model is built.
As an example, if the current training of the cheating text recognition model is the initial training, the non-initial version of the cheating text recognition model does not exist in the historical version, so that the parameters of the initial version model can be randomly initialized, and the initial version model subjected to the parameter random initialization is determined as the target version model. If the current training of the cheating text recognition model is not the initial training, if the training is carried out for the third time, two non-initial version models exist in the historical version. Optionally, a version model of which the incremental training time is adjacent to the current incremental training time may be determined as the target version model from the historical versions of the cheating text recognition models.
And 105, performing incremental training on the target version model based on incremental learning according to the second training corpus, and taking the model obtained after the incremental training as the cheating text recognition model of the latest version.
In some embodiments of the present application, the cheating text recognition model may be an ALBERT (lightweight BERT from pre-trained language characterization model), or may be other classifier models, such as a BERT model, a TextCNN (a convolutional neural network for text classification) model, and the present application is not limited herein.
It should be noted that, in the embodiment of the present application, the second corpus includes the first corpus extracted from the history corpus and the new corpus. And inputting the second training corpus into the target version model for incremental training to fine tune parameters of the target version model, and taking the model obtained after the incremental training as the cheating text recognition model of the latest version. And applying the latest version model to an online text recognition task, inputting online text data to the latest version cheating text recognition model, and outputting the probability that the text data is a cheating text by the model. The output probability can be subsequently pushed to the service end, and the service end performs corresponding treatment according to the requirement of the service end, for example, shielding treatment is performed on text data with the cheating text probability higher than a first preset value.
According to the training method of the cheating text recognition model, the target version model is determined in the historical version of the cheating text recognition model, fine adjustment is performed on the basis of the target version model based on an incremental learning mode, namely, changes caused by newly added cheating data are updated, the model does not need to be completely retrained, the model development time is shortened, the model training cost is reduced, the model can timely cope with the quickly changing online cheating content, the model timeliness is improved, and therefore online cheating residues can be effectively reduced. In addition, the target version model is subjected to incremental training based on incremental learning according to the first training corpus and the historical corpus in the newly added corpora, so that the model can identify the newly added cheating corpus and cannot lose the capacity of identifying the historical cheating corpus, the problem that the performance of the model is rapidly reduced due to the fact that new knowledge interferes with old knowledge when the model continuously obtains knowledge from non-stable data distribution is effectively solved, and the accuracy of the cheating text identification model is further improved.
It should be noted that, in order to further improve the accuracy of the cheating text recognition model, after the target version model is incrementally trained based on incremental learning according to the second training corpus to obtain the cheating text recognition model of the latest version, the cheating text recognition model of the latest version obtained after each incremental training can be tested and screened, and only the cheating text model version with good model performance is retained. As an example, fig. 2 is a flowchart illustrating a training method of a cheating text recognition model according to a second embodiment of the present application. As shown in fig. 2, the training method of the cheating text recognition model may include the following steps:
step 201, acquiring a new corpus.
Step 202, obtaining a first training corpus and a test corpus from the newly added corpus.
It should be noted that the newly added corpora include newly added corpora obtained in different ways, such as manually labeled audit corpora and newly added cheating corpora identified according to the user behavior, user attributes and other dimensions. In some embodiments of the present application, the new corpus may be randomly divided into the first corpus and the test corpus according to a ratio of 1: 1. The first training corpus is used for training the cheating text recognition model, and the testing corpus is used for verifying the cheating text recognition model of the latest version obtained after incremental training.
Step 203, generating a second corpus according to the history corpus and the first corpus.
Step 204, determining a target version model from the historical versions of the cheating text recognition models.
And step 205, performing incremental training on the target version model based on incremental learning according to the second training corpus, and taking the model obtained after the incremental training as the cheating text recognition model of the latest version.
And step 206, verifying the cheating text recognition model of the latest version according to the test corpus.
In some embodiments of the present application, the test corpus is input into the latest version of the cheating text recognition model, and a model output result is obtained, where the model output result may be a probability that the test corpus is the cheating text. And verifying the performance of the cheating text recognition model of the latest version by using the model output result.
Optionally, a second preset threshold may be set, and the test corpus with the probability higher than the second preset threshold is determined as the cheating text, and the test corpus with the probability lower than the second preset threshold is determined as the non-cheating text. After the cheating text recognition model of the latest version is identified, a large amount of test corpora are identified as cheating texts, and the cheating texts of the latest version pass the verification of the cheating text recognition model.
And step 207, responding to the verification of the cheating text recognition model of the latest version, and keeping the cheating text recognition model of the latest version.
In some embodiments of the present application, if the verification of the latest version of the cheating text recognition model fails, the latest version of the cheating text recognition model may not be saved, and the latest version of the cheating text recognition model is not used as a historical version in a next round of model training.
In the embodiment of the present application, step 201, step 203 to step 205 may be implemented by any one of the manners in the embodiments of the present application, and the present application is not specifically limited and is not described again.
According to the training method of the cheating text recognition model, the training corpus and the testing corpus are obtained from the new corpus, the training corpus and a part of historical corpus are used as a training set for latest sequential incremental training, the training set is utilized to perform fine adjustment on a model of a certain version in a historical version, changes caused by the newly added cheating data are updated, the new model does not need to be completely retrained, the model development time is shortened, the training cost is reduced, the model can timely cope with quickly changing online cheating contents, and the model timeliness is improved; in addition, because the training set is provided with the newly added corpora and the historical corpora, when the model is fine-tuned, the model can learn the recognition capability of the newly added cheating corpora, and meanwhile, the recognition capability of the historical cheating corpora cannot be lost, so that the problem that the model performance is rapidly reduced due to the fact that the new knowledge interferes with the old knowledge when the model continuously obtains knowledge from non-stable data distribution can be effectively solved. In addition, the cheating text recognition model of the latest version is tested and screened by using the test corpus, only the cheating text model with good model performance is reserved, and the accuracy of the cheating text recognition model can be further improved.
It should be noted that, in order to further reduce the training cost of the model and improve the timeliness of the model, when the model is subjected to incremental training, a target version model is determined from the historical version of the cheating text recognition model and is trained. As an example, fig. 3 is a schematic flowchart of a training method of a cheating text recognition model according to a third embodiment of the present application. As shown in fig. 3, the training method of the cheating text recognition model may include the following steps:
and 301, acquiring the newly added corpus.
Step 302, obtaining a first training corpus and a test corpus from the newly added corpus.
Step 303, generating a second corpus according to the history corpus and the first corpus.
Step 304, judging whether a non-initial version which passes the test and is reserved exists in the historical versions of the cheating text recognition model. If there is a non-primary version that passes the test and is retained in the history version, go to step 305; if there is no non-primary version in the history version that passes the test and is retained, go to step 307.
Step 305, determining the current time of the incremental training.
Step 306, finding out a first non-initial version of the incremental training time adjacent to the current time from the non-initial versions, and determining the first non-initial version as a target version model.
Step 307, initializing parameters of the initial version model, and determining the initial version model after parameter initialization as a target version model.
Optionally, when there is no non-initial version that passes the test and is retained in the historical version, the parameters of the initial version model may be randomly initialized, and the initial version model subjected to parameter random initialization is determined as the target version model.
And 308, performing incremental training on the target version model based on incremental learning according to the second training corpus, and taking the model obtained after the incremental training as the cheating text recognition model of the latest version.
And 309, verifying the cheating text recognition model of the latest version according to the test corpus.
And step 310, responding to the verification of the cheating text recognition model of the latest version, and keeping the cheating text recognition model of the latest version.
In the embodiment of the present application, steps 301 to 303, and steps 308 to 310 may be implemented by any one of the manners in the embodiments of the present application, and the present application is not specifically limited and is not described again.
According to the training method of the cheating text recognition model, a target version model is determined in a historical version of the cheating text recognition model, and if a non-initial version which passes a test and is reserved exists in the historical version, a first non-initial version of which incremental training time is adjacent to the current time is determined as the target version model. If the non-initial version which passes the test and is reserved does not exist in the historical version, initializing the parameters of the initial version model, and determining the initial version model after parameter initialization as the target version model. Need not to retrain new model according to newly-increased corpus and historical corpus, reduce model development time, reduced the training cost, and make the model can in time deal with the online content of practising fraud of rapid change, promote the model ageing, and effectively reduce online residue of practising fraud. In addition, the incremental training is carried out on the target version model based on incremental learning according to the first training corpus and the historical corpus in the newly added corpora, so that the model can identify the newly added cheating corpus and simultaneously cannot lose the capacity of identifying the historical cheating corpus, and the problem that the performance of the model is rapidly reduced due to the fact that new knowledge interferes with old knowledge when the model continuously obtains knowledge from non-stable data distribution is effectively solved. After the target version model is subjected to incremental training based on incremental learning according to the second training corpus to obtain the cheating text recognition model of the latest version, the cheating text recognition model of the latest version obtained after each incremental training is tested and screened by using the test corpus, only the cheating text model with good model performance is reserved, and the accuracy of the cheating text recognition model is further improved.
It should be noted that, in order to update the online cheating corpora in time, in addition to performing incremental training on the model, a part of the online newly added corpora may be extracted and added to the newly added corpora of the next round of model training after manual labeling. As an example, fig. 4 is a flowchart illustrating a training method of a cheating text recognition model according to a fourth embodiment of the present application. As shown in fig. 4, on the basis of the above embodiment, the training method of the cheating text recognition model may further include the following steps:
step 401, acquiring the on-line new corpus, and labeling the on-line new corpus.
In some embodiments of the present application, a part of the newly added corpora on the line may be randomly extracted, and the randomly extracted newly added corpora on the line may be manually labeled.
And step 402, adding the marked on-line newly added linguistic data into the newly added linguistic data of the next round of model training.
In some embodiments of the present application, the marked on-line new corpus may be added as a review corpus to the new corpus of the next round of model training.
According to the training method of the cheating text recognition model, the on-line new corpus is obtained besides incremental training of the model, after manual marking, the on-line new corpus is added into the new corpus of the next round of model training, the cheating text recognition model is updated iteratively, and the accuracy of the cheating text recognition model is further improved.
Fig. 5 is a block diagram of a structure of a training apparatus for a cheating text recognition model according to a fifth embodiment of the present application. As shown in fig. 5, the training device of the cheating text recognition model may include a first obtaining module 501, a second obtaining module 502, a generating module 503, a determining module 504 and a training module 505.
Specifically, the first obtaining module 501 is configured to obtain a new corpus.
The second obtaining module 502 is configured to obtain the first corpus from the new corpus.
The generating module 503 is configured to generate a second corpus according to the history corpus and the first corpus.
A determining module 504 for determining a target version model from the historical versions of the cheating text-recognition model.
And the training module 505 is configured to perform incremental training on the target version model based on incremental learning according to the second training corpus, and use the model obtained after the incremental training as the cheating text recognition model of the latest version.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
According to the device for training the cheating text recognition model, the target version model is determined in the historical version of the cheating text recognition model, fine adjustment is performed on the basis of the target version model based on an incremental learning mode, namely, changes caused by newly added cheating data are updated, the model does not need to be completely retrained, the model development time is shortened, the model training cost is reduced, the model can timely cope with the quickly changing online cheating content, the model timeliness is improved, and therefore online cheating residues can be effectively reduced. In addition, the target version model is subjected to incremental training based on incremental learning according to the first training corpus and the historical corpus in the newly added corpora, so that the model can identify the newly added cheating corpus and cannot lose the capacity of identifying the historical cheating corpus, the problem that the performance of the model is rapidly reduced due to the fact that new knowledge interferes with old knowledge when the model continuously obtains knowledge from non-stable data distribution is effectively solved, and the accuracy of the cheating text identification model is further improved.
Fig. 6 is a block diagram of a training apparatus for a cheating-text-recognition model according to a sixth embodiment of the present application. As shown in fig. 6, on the basis of the above embodiment, the training device of the cheating text recognition model may further include: a third acquisition module 606, a verification module 607, and a reservation module 608.
Specifically, the third obtaining module 606 is configured to obtain the test corpus from the new corpus.
And the verification module 607 is configured to verify the cheating text recognition model of the latest version according to the test corpus.
The retention module 608, in response to the latest version of the cheating text recognition model verifying, is configured to retain the latest version of the cheating text recognition model.
In some embodiments of the present application, the determining module 604 is further configured to determine whether there is a non-primary version that passes the test verification and is retained in the historical version; determining the current time of the incremental training in response to the fact that the historical version has a non-initial version which passes test verification and is reserved; and finding out a first non-initial version of the incremental training time adjacent to the current time from the non-initial versions, and determining the first non-initial version as a target version model.
In some embodiments of the present application, the determining module 604 is further configured to initialize parameters of the initial version model in response to that no non-initial version which passes the test verification and is retained exists in the historical version, and determine the initial version model initialized by the parameters as the target version model.
Wherein 601-605 in fig. 6 and 501-505 in fig. 5 have the same functions and structures.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
According to the training device of the cheating text recognition model, the training corpus and the testing corpus are obtained from the new corpus, the training corpus and a part of historical corpus are used as a training set for latest sequential incremental training, the training set is utilized to perform fine adjustment on a model of a certain version in a historical version, changes caused by the newly added cheating data are updated, the new model does not need to be completely retrained, the model development time is shortened, the training cost is reduced, the model can timely cope with quickly changing online cheating contents, and the model timeliness is improved; in addition, because the training set is provided with the newly added corpora and the historical corpora, when the model is fine-tuned, the model can learn the recognition capability of the newly added cheating corpora, and meanwhile, the recognition capability of the historical cheating corpora cannot be lost, so that the problem that the model performance is rapidly reduced due to the fact that the new knowledge interferes with the old knowledge when the model continuously obtains knowledge from non-stable data distribution can be effectively solved. In addition, the cheating text recognition model of the latest version is tested and screened by using the test corpus, only the cheating text model with good model performance is reserved, and the accuracy of the cheating text recognition model can be further improved.
Fig. 7 is a block diagram illustrating a configuration of a training apparatus for a cheating-text-recognition model according to a seventh embodiment of the present application. As shown in fig. 7, on the basis of the above embodiment, the training device of the cheating text recognition model may further include: a fourth obtaining module 709 and a newly adding module 710.
Specifically, the fourth obtaining module 709 is configured to obtain the on-line new corpus and label the on-line new corpus.
And the adding module 710 is configured to add the marked on-line adding corpus into the adding corpus of the next round of model training.
Wherein 701-708 in fig. 7 and 601-608 in fig. 6 have the same functions and structures.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
According to the training device of the cheating text recognition model, the incremental training is carried out on the model, the on-line newly-added linguistic data are obtained through the fourth obtaining module, after manual marking, the on-line newly-added linguistic data are added into the newly-added linguistic data of the next round of model training through the newly-added module, the iterative cheating text recognition model updating is achieved, and the accuracy of the cheating text recognition model is further improved.
There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.
As shown in fig. 8, fig. 8 is a block diagram of an electronic device for implementing a training method of a cheating-text recognition model according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.
The memory 802 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of training a cheating-text-recognition model provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the training method of the cheating-text-recognition model provided by the present application.
The memory 802 serves as a non-transitory computer readable storage medium, and may be used for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the training method of the cheating-text-recognition model in the embodiments of the application (for example, the first obtaining module 701, the second obtaining module 702, the generating module 703, the determining module 704 and the training module 705, the third obtaining module 706, the verifying module 707 and the keeping module 708, the fourth obtaining module 709 and the adding module 710 shown in fig. 7). The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the training method of the cheating text recognition model in the above-described method embodiments.
The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by use of the electronic device of the training method of the cheating-text recognition model, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected via a network to an electronic device of a training method of a cheating text recognition model. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device to implement the training method of the cheating text recognition model may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.
The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the training method of the cheating-text-recognition model, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: embodied in one or more computer programs that when executed by a processor implement the method of training the cheating-text-recognition model described in the embodiments above, the one or more computer programs being executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special-purpose or general-purpose programmable processor, may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain. It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (13)

1. A training method of a cheating text recognition model comprises the following steps:
acquiring a newly added corpus;
acquiring a first training corpus from the newly added corpus;
generating a second training corpus according to the historical corpus and the first training corpus;
determining a target version model from the historical versions of the cheating text recognition model;
and performing incremental training on the target version model based on incremental learning according to the second training corpus, and taking the model obtained after the incremental training as the cheating text recognition model of the latest version.
2. The method of claim 1, further comprising:
obtaining a test corpus from the newly added corpus;
verifying the cheating text recognition model of the latest version according to the test corpus;
in response to verification of the latest version of the cheating text recognition model, retaining the latest version of the cheating text recognition model.
3. The method of claim 1, wherein the determining a target version model from the historical versions of the cheating text-recognition model comprises:
judging whether a non-initial version which passes test verification and is reserved exists in the historical version;
determining the current time of the incremental training in response to the fact that a non-initial version which passes test verification and is reserved exists in the historical versions;
and finding out a first non-initial version of which the incremental training time is adjacent to the current time from the non-initial versions, and determining the first non-initial version as a target version model.
4. The method of claim 3, wherein the determining a target version model from the historical versions of the cheating text-recognition model further comprises:
and initializing parameters of an initial version model in response to the fact that no non-initial version which passes test verification and is reserved exists in the historical versions, and determining the initial version model subjected to parameter initialization as the target version model.
5. The method of claim 1, further comprising:
acquiring newly added corpora on the line, and labeling the newly added corpora on the line;
and adding the marked newly added linguistic data on the line into the newly added linguistic data of the next round of model training.
6. A training apparatus of a cheating-text-recognition model, comprising:
the first acquisition module is used for acquiring the newly added corpora;
the second obtaining module is used for obtaining a first training corpus from the newly added corpus;
the generating module is used for generating a second training corpus according to the historical corpus and the first training corpus;
the determining module is used for determining a target version model from the historical versions of the cheating text recognition models;
and the training module is used for carrying out incremental training on the target version model based on incremental learning according to the second training corpus, and taking the model obtained after the incremental training as the cheating text recognition model of the latest version.
7. The apparatus of claim 6, further comprising:
a third obtaining module, configured to obtain the test corpus from the new corpus;
the verification module is used for verifying the cheating text recognition model of the latest version according to the test corpus;
and the retention module is used for responding to the verification of the cheating text recognition model of the latest version and retaining the cheating text recognition model of the latest version.
8. The apparatus according to claim 6, wherein the determining module is specifically configured to:
judging whether a non-initial version which passes test verification and is reserved exists in the historical version;
determining the current time of the incremental training in response to the fact that a non-initial version which passes test verification and is reserved exists in the historical versions;
and finding out a first non-initial version of which the incremental training time is adjacent to the current time from the non-initial versions, and determining the first non-initial version as a target version model.
9. The apparatus of claim 8, wherein the determining module is specifically configured to:
and initializing parameters of an initial version model in response to the fact that no non-initial version which passes test verification and is reserved exists in the historical versions, and determining the initial version model subjected to parameter initialization as the target version model.
10. The apparatus of claim 6, further comprising:
the fourth acquisition module is used for acquiring the on-line newly added corpora and labeling the on-line newly added corpora;
and the adding module is used for adding the marked on-line added corpus into the added corpus of the next round of model training.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 5.
12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 5.
13. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 5.
CN202111564589.4A 2021-12-20 2021-12-20 Training method, device, equipment and storage medium of cheating text recognition model Pending CN114372514A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111564589.4A CN114372514A (en) 2021-12-20 2021-12-20 Training method, device, equipment and storage medium of cheating text recognition model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111564589.4A CN114372514A (en) 2021-12-20 2021-12-20 Training method, device, equipment and storage medium of cheating text recognition model

Publications (1)

Publication Number Publication Date
CN114372514A true CN114372514A (en) 2022-04-19

Family

ID=81140989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111564589.4A Pending CN114372514A (en) 2021-12-20 2021-12-20 Training method, device, equipment and storage medium of cheating text recognition model

Country Status (1)

Country Link
CN (1) CN114372514A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117931278A (en) * 2024-01-23 2024-04-26 镁佳(武汉)科技有限公司 Model error correction training method and device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117931278A (en) * 2024-01-23 2024-04-26 镁佳(武汉)科技有限公司 Model error correction training method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111507104B (en) Method and device for establishing label labeling model, electronic equipment and readable storage medium
CN111539223A (en) Language model training method and device, electronic equipment and readable storage medium
CN111639710A (en) Image recognition model training method, device, equipment and storage medium
CN112560912A (en) Method and device for training classification model, electronic equipment and storage medium
CN112036509A (en) Method and apparatus for training image recognition models
CN111967262A (en) Method and device for determining entity tag
CN111737994A (en) Method, device and equipment for obtaining word vector based on language model and storage medium
CN111737995A (en) Method, device, equipment and medium for training language model based on multiple word vectors
CN111950291A (en) Semantic representation model generation method and device, electronic equipment and storage medium
CN111079945B (en) End-to-end model training method and device
CN111667056A (en) Method and apparatus for searching model structure
CN111737996A (en) Method, device and equipment for obtaining word vector based on language model and storage medium
CN111582477B (en) Training method and device for neural network model
CN111339759A (en) Method and device for training field element recognition model and electronic equipment
CN112561056A (en) Neural network model training method and device, electronic equipment and storage medium
CN112329453B (en) Method, device, equipment and storage medium for generating sample chapter
CN111831814A (en) Pre-training method and device of abstract generation model, electronic equipment and storage medium
CN111241810A (en) Punctuation prediction method and device
CN111090991A (en) Scene error correction method and device, electronic equipment and storage medium
CN111783427B (en) Method, device, equipment and storage medium for training model and outputting information
CN112380847A (en) Interest point processing method and device, electronic equipment and storage medium
CN111291192B (en) Method and device for calculating triplet confidence in knowledge graph
CN111127191A (en) Risk assessment method and device
CN111967591A (en) Neural network automatic pruning method and device and electronic equipment
CN111966782A (en) Retrieval method and device for multi-turn conversations, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination