CN113158652A - Data enhancement method, device, equipment and medium based on deep learning model - Google Patents

Data enhancement method, device, equipment and medium based on deep learning model Download PDF

Info

Publication number
CN113158652A
CN113158652A CN202110420110.3A CN202110420110A CN113158652A CN 113158652 A CN113158652 A CN 113158652A CN 202110420110 A CN202110420110 A CN 202110420110A CN 113158652 A CN113158652 A CN 113158652A
Authority
CN
China
Prior art keywords
data
original
parameter list
replacement
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110420110.3A
Other languages
Chinese (zh)
Other versions
CN113158652B (en
Inventor
李鹏宇
李剑锋
陈又新
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110420110.3A priority Critical patent/CN113158652B/en
Priority to PCT/CN2021/096475 priority patent/WO2022222224A1/en
Publication of CN113158652A publication Critical patent/CN113158652A/en
Application granted granted Critical
Publication of CN113158652B publication Critical patent/CN113158652B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a data enhancement method, a device, equipment and a medium based on a deep learning model, which are used in the field of artificial intelligence, relate to the field of block chains, and comprise the following steps: randomly initializing an original parameter list according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists, converting original training data by using the optimized parameter lists to obtain corresponding artificial construction data, mixing the original training data with the corresponding artificial construction data to obtain a plurality of training sets, further training to obtain a plurality of recognition models, determining whether a model meeting a convergence condition exists in the plurality of recognition models, and if so, outputting a target data enhancement parameter list to perform data enhancement on the original training data and obtain a training set of a named entity recognition model; in the invention, the artificial fish swarm algorithm is taken as a frame, and the model identification effect is taken as an optimization target to be fused into the formulation of the data enhancement strategy, so that the data enhancement effect on the data is improved.

Description

Data enhancement method, device, equipment and medium based on deep learning model
Technical Field
The invention relates to the field of artificial intelligence, in particular to a data enhancement method, a data enhancement device, data enhancement equipment and data enhancement media based on a deep learning model.
Background
With the development of intelligent technology, in the application field of natural language processing methods such as question-answering systems, machine translation systems and the like, the requirement of Named Entity Recognition (NER) tasks is more and more, and therefore, a Named Entity Recognition task is executed based on a Named Entity Recognition model obtained by Entity data training and becomes a more and more common Recognition mode. In order to improve the recognition rate of the named entity recognition model for the entity in the text to be recognized, the accuracy of the named entity recognition model is usually enhanced from the two aspects of enhancing training data or enhancing a model algorithm.
In the prior art, a data enhancement model of a named entity recognition model replaces entity words in training data mainly through different data enhancement methods and parameters corresponding to the data enhancement methods, for example, synonym replacement, random insertion, random position exchange, random deletion and the like are performed on the entity words in the training data with a certain probability, so as to increase the scale and diversity of the training data. The enhancement effect of the data enhancement model on the training data is inseparable from the model parameters, but the model parameters of the existing data enhancement model are determined by experience or a parameter optimization method of grid search, and the interaction with the named entity recognition model is low, so that the enhancement effect of the data enhancement model on the training data is poor.
Disclosure of Invention
The invention provides a data enhancement method, a data enhancement device, data enhancement equipment and a data enhancement medium based on a deep learning model, and solves the problem that in the prior art, model parameters of the data enhancement model are determined by depending on experience or a parameter optimization method of grid search, so that the data enhancement effect of the data enhancement model is poor.
A data enhancement method based on a deep learning model comprises the following steps:
acquiring original training data and original test data which are marked manually, and acquiring an original parameter list, wherein the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method;
randomly initializing the enhanced parameters in the original parameter list according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists;
converting the original training data by utilizing each optimization parameter list to obtain corresponding artificially constructed data, and mixing the original training data and the corresponding artificially constructed data to obtain a plurality of training sets;
respectively training by using the training sets to obtain a plurality of recognition models, and testing the recognition models by using the original test data as a test set to determine whether a model meeting a convergence condition exists in the recognition models;
if the model meeting the convergence condition exists in the plurality of identification models, outputting an optimization parameter list corresponding to the model meeting the convergence condition as a target data enhancement parameter list;
and performing data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of the named entity recognition model.
Further, after determining whether there is a model satisfying a convergence condition among the plurality of recognition models, the method further includes:
if the model meeting the convergence condition does not exist in the plurality of identification models, randomly initializing the enhanced parameters in the original parameter list according to an artificial fish swarm algorithm again to obtain a plurality of optimized parameter lists after random initialization, and counting;
determining whether the times of random initialization of the enhanced parameters in the original parameter list is less than a preset time;
if the times of carrying out random initialization on the enhanced parameters in the original parameter list are not less than the preset times, stopping carrying out the random initialization on the enhanced parameters in the original parameter list;
and if the times of randomly initializing the enhanced parameters in the original parameter list are less than the preset times, training according to the randomly initialized optimized parameter list to obtain a plurality of new recognition models, testing the new recognition models to obtain the target data enhanced parameter list, and obtaining a training set of the named entity recognition model by using the target data enhanced parameter list.
Further, the data enhancement method includes a synonym replacement method, and the converting the original training data by using each of the optimized parameter lists includes:
determining an enhanced parameter corresponding to the synonym replacement method in the optimized parameter list, wherein the enhanced parameter corresponding to the synonym replacement method comprises an entity word category replacement probability and an entity word replacement category;
acquiring a preset synonym dictionary which is pre-constructed by a user according to requirements, wherein entity words which are not forbidden in synonym relation in the same entity category are used as synonyms of each other in the preset synonym dictionary;
and carrying out synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category.
Further, the performing synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category includes:
determining whether the category of each entity word in the original training data belongs to the entity word replacement category;
if the category of the entity word in the original training data belongs to the entity word replacement category, searching the synonym of the entity word in the preset synonym dictionary;
determining whether a synonym relationship is prohibited between the entity and a synonym of the entity;
and if the synonyms of the entity words and the entity words are not forbidden in synonym relationship, selecting one synonym from the preset synonym dictionary as a replacement word according to the entity word category replacement probability so as to replace the entity word with the replacement word.
Further, the data enhancement method further includes a random replacement method, a random deletion method, a random exchange method and a long sentence construction method, and after the synonym replacement is performed on the entity words in the original training data, the method further includes:
in the optimization parameter list, determining the random replacement probability of the random replacement method and determining the random deletion probability of the random deletion method;
determining the random exchange probability of the random exchange method and determining the sentence length set by the long sentence constructing method;
carrying out entity word replacement on each sentence in the original training data according to the random replacement probability, and carrying out entity word exchange of the same sentence on each sentence in the original training data according to the random exchange probability;
deleting entity words of each sentence in the original training data according to the random deletion probability to obtain processing data;
and splicing each sentence in the processing data to ensure that the length of the sentence after the processing is finished is the sentence length.
Further, the determining whether a convergence model exists in the plurality of recognition models comprises:
determining a highest recognition score for recognizing each word in the test set in the plurality of recognition models;
determining whether the highest recognition score satisfies the convergence condition;
if the highest recognition score meets the convergence condition, determining that the convergence model meeting the convergence condition exists in the plurality of recognition models, wherein the recognition model corresponding to the highest recognition score is the convergence model;
and if the highest identification score does not meet the convergence condition, determining that the convergence model meeting the convergence condition does not exist in the plurality of identification models.
Further, the determining whether the highest recognition score satisfies the convergence condition comprises:
determining a convergence parameter configured by a user;
determining a first highest recognition score for recognizing a tth word in the test set in the plurality of recognition models;
determining a second highest recognition score for recognizing the t-1 th word in the test set in the plurality of recognition models;
subtracting the second highest identification score from the first highest identification score to obtain a highest identification score difference;
determining whether a ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;
determining that the highest recognition score satisfies the convergence condition if the ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;
determining that the highest recognition score does not satisfy the convergence condition if the ratio of the highest recognition score difference to the second highest recognition score is not less than the convergence parameter.
A deep learning model-based data enhancement apparatus, comprising:
the system comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is used for acquiring original training data and original test data which are marked manually and acquiring an original parameter list, and the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method;
the initialization module is used for randomly initializing the enhanced parameters in the original parameter list according to an artificial fish swarm algorithm so as to obtain a plurality of optimized parameter lists;
the conversion module is used for converting the original training data by utilizing each optimization parameter list to obtain corresponding artificial construction data, and mixing the original training data and the corresponding artificial construction data to obtain a plurality of training sets;
the testing module is used for respectively training the training sets to obtain a plurality of recognition models, and testing the recognition models by taking the original test data as a test set so as to determine whether a model meeting a convergence condition exists in the recognition models;
the output module is used for outputting an optimization parameter list corresponding to the model meeting the convergence condition as a target data enhancement parameter list if the model meeting the convergence condition exists in the plurality of identification models;
and the enhancement module is used for performing data enhancement on the original training data by utilizing the target data enhancement parameter list so as to obtain a training set of the named entity recognition model.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above-described deep learning model-based data enhancement method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned deep learning model-based data enhancement method.
In one scheme provided by the data enhancement method, the device, the equipment and the medium based on the deep learning model, original training data and original test data which are marked manually are obtained, an original parameter list is obtained, the original parameter list is formed by a data enhancement method and enhancement parameters corresponding to the data enhancement method, the enhancement parameters in the original parameter list are initialized randomly according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists, the original training data are converted by utilizing each optimized parameter list to obtain corresponding artificial construction data, the original training data and the corresponding artificial construction data are mixed to obtain a plurality of training sets, a plurality of recognition models are obtained by utilizing the plurality of training sets through training respectively, the original test data are used as the test sets to test the plurality of recognition models to determine whether the models meeting the convergence condition exist in the plurality of recognition models, if the models meeting the convergence condition exist in the multiple recognition models, outputting an optimization parameter list corresponding to the models meeting the convergence condition as a target data enhancement parameter list, and performing data enhancement on original training data by using the target data enhancement parameter list to obtain a training set of the named entity recognition model; in the invention, an artificial fish swarm algorithm suitable for the coexistence situation of discrete values and continuous values is adopted to randomly initialize the enhancement parameters in the original parameter list, and the identification effect of the identification model is taken as an optimization target to be fused into the formulation of a data enhancement strategy, so that a data enhancement list with better effect is obtained at lower cost, and the data enhancement effect of the data enhancement list on data is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of an application environment of a deep learning model-based data enhancement method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data enhancement method based on deep learning model according to an embodiment of the present invention;
FIG. 3 is another schematic flow chart of a deep learning model-based data enhancement method according to an embodiment of the present invention;
FIG. 4 is a flowchart of one implementation of step S30 in FIG. 2;
FIG. 5 is a flowchart of one implementation of step S33 in FIG. 4;
FIG. 6 is a flowchart of another implementation of step S30 in FIG. 2;
FIG. 7 is a flowchart of one implementation of step S50 in FIG. 2;
FIG. 8 is a flowchart of one implementation of step S52 in FIG. 7;
FIG. 9 is a schematic structural diagram of a deep learning model-based data enhancement apparatus according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The data enhancement method based on the deep learning model provided by the embodiment of the invention can be applied to the application environment shown in figure 1, wherein the terminal equipment is communicated with the server through a network. The server obtains original training data and original test data which are sent by a user through terminal equipment and are manually marked, obtains an original parameter list sent by the user through the terminal equipment, the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method, randomly initializes the enhancement parameters in the original parameter list according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists, converts the original training data by utilizing each optimized parameter list to obtain corresponding artificial construction data, mixes the original training data with the corresponding artificial construction data to obtain a plurality of training sets, respectively trains by utilizing the plurality of training sets to obtain a plurality of recognition models, tests the plurality of recognition models by taking the original test data as a test set to determine whether the models meeting convergence conditions exist in the plurality of recognition models, if the models meeting the convergence condition exist in the plurality of recognition models, outputting an optimization parameter list corresponding to the models meeting the convergence condition as a target data enhancement parameter list, performing data enhancement on the original training data by using the target data enhancement parameter list, so as to obtain a training set of the named entity recognition model, randomly initialize the enhanced parameters in the original parameter list by adopting an artificial fish swarm algorithm suitable for the coexistence situation of discrete values and continuous values, the recognition effect of the recognition model is taken as an optimization target to be fused into the formulation of a data enhancement strategy, a data enhancement list with better effect is obtained with lower cost, thereby ensuring the data diversity of the named entity recognition model training set, enlarging the scale of the training set, and the recognition accuracy of the named entity recognition model is improved, and further the training data enhancement and the artificial intelligence of the named entity recognition are realized.
The database in the embodiment is stored in a block chain network and used for storing data used and generated for realizing the deep learning model-based data enhancement method, such as original training data, original test data, an original parameter list, artificial construction data, an optimization parameter list, a plurality of identification models and other related data. The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like. The database is deployed in the blockchain, so that the safety of data storage can be improved.
The terminal device may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.
In an embodiment, as shown in fig. 2, a data enhancement method based on a deep learning model is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:
s10: the method comprises the steps of obtaining original training data and original test data which are marked manually, and obtaining an original parameter list, wherein the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method.
It can be understood that the original parameter list in this embodiment is a data enhancement model, the data enhancement model is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method, and the data enhancement performance of the data enhancement model depends on the data enhancement method in the model and the enhancement parameters corresponding to the data enhancement method, so that before the data enhancement model is utilized, parameter optimization needs to be performed on the existing data enhancement model to improve the enhancement performance of the data enhancement model on the training data, thereby ensuring the recognition accuracy of the named entity recognition model obtained by subsequent training data.
Parameter optimization is carried out on the existing data enhancement model, the existing data enhancement model is required to be obtained, namely an original parameter list of the model is obtained, and original training data and original test data which are marked manually are obtained.
S20: and randomly initializing the enhanced parameters in the original parameter list according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists.
After the original training data and the original test data which are manually marked are obtained and an original parameter list is obtained, random initialization is carried out on enhanced parameters in the original parameter list by taking an artificial fish school algorithm which has high convergence speed and is suitable for the coexistence situation of discrete values and continuous values as a frame so as to obtain a plurality of optimized parameter lists.
S30: and converting the original training data by utilizing each optimized parameter list to obtain corresponding artificially constructed data, and mixing the original training data and the corresponding artificially constructed data to obtain a plurality of training sets.
After obtaining a plurality of optimization parameter lists, converting the original training data by using each optimization parameter list to obtain corresponding artificial construction data, and randomly disordering the original training data and the corresponding artificial construction data to obtain a plurality of training sets in a mixing manner.
For example, after L optimization parameter lists are obtained, the original training data is converted by using each optimization parameter list, then L sets of corresponding artificial construction data are obtained, each set of artificial construction data corresponds to one optimization parameter list, and after L sets of corresponding artificial construction data are obtained, the original training data is respectively mixed with each set of artificial construction data to obtain L training sets.
S40: and respectively training by using a plurality of training sets to obtain a plurality of recognition models, and testing the plurality of recognition models by using the original test data as a test set.
After obtaining a plurality of training sets, respectively training by using the plurality of training sets to obtain a plurality of recognition models, taking original test data as a test set, testing the plurality of recognition models by using the test set, and obtaining the recognition effect (recognition score) of each recognition model on each entity word in the test set as a test result.
S50: and determining whether a model meeting a convergence condition exists in the plurality of recognition models according to the test result.
After the original test data is used as a test set to test a plurality of recognition models, whether a model meeting a convergence condition exists in the plurality of recognition models is determined according to the recognition effect of each recognition model on each entity word in the test set, namely a test result. Wherein the plurality of recognition models may be conventional entity recognition models.
S60: and if the models meeting the convergence condition exist in the plurality of identification models, outputting an optimization parameter list corresponding to the models meeting the convergence condition as a target data enhancement parameter list.
After whether a model meeting the convergence condition exists in the plurality of recognition models is determined, if the model meeting the convergence condition exists in the plurality of recognition models, the recognition effect of the existing recognition models in the plurality of recognition models meets the user requirement, correspondingly, the training set used by the model meeting the convergence condition meets the requirement, further, the optimization parameter list corresponding to the training set is determined to be a data enhancement list meeting the data enhancement requirement, and the corresponding optimization parameter list is output to be used as a target data enhancement parameter list.
S70: and performing data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of the named entity recognition model.
After the corresponding optimized parameter list is output and used as a target data enhancement parameter list, data enhancement is carried out on original training data by using the target data enhancement parameter list, and then random scrambling is carried out on the enhanced data after the data enhancement and the original training data so as to obtain a training set of the named entity recognition model in a mixed mode, so that a more accurate named entity recognition model can be obtained, and the recognition precision of the named entity recognition model is guaranteed.
It should be understood that the artificial fish swarm algorithm is a particle swarm optimization algorithm, and the particles are regarded as fish trying to reach the position with the highest concentration of food in the water area, so that the living state of the fish is improved. In this embodiment, the particles and the artificial fish are enhancement parameters in an original parameter list subjected to random initialization, the food concentration is a cost function or a loss function of the identification model, and a swimming process of the artificial fish in an algorithm operation process is a process in which the enhancement parameters in the original parameter list gradually approach an optimal position and the value of the cost function or the loss function approaches a minimum value.
The original parameter list formed by the data enhancement method and the enhancement parameters corresponding to the data enhancement method can be as shown in table 1:
TABLE 1
Figure BDA0003027493320000081
In this embodiment, as shown in table 1, the artificial fish swarm algorithm is used as a framework to calculate the discrete value β in the original parameter list1To beta5And a discrete value psynCombining to form a original parameter list mixed with continuous value and discrete value, the original parameter list includes data enhancement method and corresponding enhancement parameter, and using artificial fish swarm algorithm to make iterative optimization on the enhancement parameter of the original parameter listTherefore, an optimized parameter list is obtained, the original training data are processed by using the optimized parameter list to obtain artificially constructed data, the artificially constructed data and the original training data are mixed, a training set with higher quality is obtained at lower cost, and the recognition accuracy of the named entity recognition model is ensured.
In the embodiment, by acquiring original training data and original test data which are marked manually, and acquiring an original parameter list, wherein the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method, the enhancement parameters in the original parameter list are initialized randomly according to an artificial fish swarm algorithm to acquire a plurality of optimized parameter lists, each optimized parameter list is used for converting the original training data to acquire corresponding artificial construction data, the original training data and the corresponding artificial construction data are mixed to acquire a plurality of training sets, the plurality of training sets are respectively used for training to acquire a plurality of recognition models, the original test data are used as the test sets to test the plurality of recognition models to determine whether the plurality of recognition models have models meeting the convergence condition, if the plurality of recognition models have models meeting the convergence condition, outputting an optimized parameter list corresponding to the model meeting the convergence condition as a target data enhancement parameter list, and performing data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of the named entity recognition model; according to the method, an artificial fish swarm algorithm suitable for the coexistence situation of discrete values and continuous values is adopted to randomly initialize the enhanced parameters in the original parameter list, the recognition effect of the recognition model is used as an optimization target to be fused into the formulation of a data enhancement strategy, and a data enhancement list with a good effect is obtained at a low cost, so that the data diversity of a named entity recognition model training set is ensured, the scale of the training set is enlarged, and the recognition accuracy of the named entity recognition model is improved.
In addition, because the enhancement parameters of each data enhancement method in the target data enhancement parameter list are obtained by automatic optimization, the embodiment can support the extension of the data enhancement method, and obtain different data enhancement lists according to the requirements of users, thereby constructing more model training data and further ensuring the accuracy of the model.
In an embodiment, the data enhancement method includes a synonym replacement method, as shown in fig. 3, after step S50, that is, after determining whether there is a model satisfying the convergence condition in the multiple recognition models according to the test result, the method further includes the following steps:
s80: and if the plurality of identification models do not have a model meeting the convergence condition, randomly initializing the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm again to obtain a plurality of optimized parameter lists after random initialization, and counting.
After whether a model meeting the convergence condition exists in the plurality of recognition models is determined, if the model meeting the convergence condition does not exist in the plurality of recognition models, the recognition effect of the plurality of recognition models does not meet the user requirement, and the enhancement effect of the optimization parameter list of the random optimization on the original training data is insufficient. And at the moment, randomly initializing the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm, training according to the optimized parameter list after random initialization to obtain a plurality of recognition models, and testing the recognition models until a target data enhanced parameter list is obtained. Meanwhile, when the enhanced parameters in the original parameter list are randomly initialized according to the artificial fish swarm algorithm, the times of repeatedly randomly initializing the enhanced parameters in the original parameter list need to be recorded and counted.
S90: and determining whether the times of random initialization of the enhanced parameters of the original parameter list is less than the preset times.
S100: and if the times of carrying out random initialization on the enhanced parameters in the original parameter list are not less than the preset times, stopping carrying out random initialization on the enhanced parameters in the original parameter list.
After determining whether the times of randomly initializing the enhanced parameters of the original parameter list is less than the preset times, if the times of randomly initializing the enhanced parameters of the original parameter list is not less than the preset times, the number of iterations is excessive, in order to reduce the calculation burden, the random initialization of the enhanced parameters in the original parameter list needs to be stopped, the times can be used for outputting an optimized parameter list corresponding to a model close to a convergence condition to be used as a target data enhanced parameter list, and then the target data enhanced parameter list is used for performing data enhancement on the original training data to obtain a training set of the named entity recognition model.
S110: if the number of times of randomly initializing the enhanced parameters of the original parameter list is less than the preset number of times, the steps S30-S70 are repeatedly executed.
After determining whether the number of times of randomly initializing the enhanced parameters of the original parameter list is less than the preset number of times, if the number of times of randomly initializing the enhanced parameters of the original parameter list is less than the preset number of times, and at this time, the target data enhanced parameter list is not determined, the above steps S30-S70 need to be repeatedly performed, that is, a plurality of new recognition models need to be obtained by retraining according to the randomly initialized optimized parameter list, and the plurality of new recognition models are tested to obtain the target data enhanced parameter list, and a training set of the named entity recognition model is obtained by using the target data enhanced parameter list.
In this embodiment, after determining whether a model satisfying a convergence condition exists in a plurality of recognition models, if a model satisfying the convergence condition does not exist in the plurality of recognition models, randomly initializing an enhanced parameter in an original parameter list according to an artificial fish school algorithm again to obtain a plurality of optimized parameter lists after random initialization, counting, determining whether the number of times of randomly initializing the enhanced parameter of the original parameter list is less than a preset number of times, determining whether the number of times of randomly initializing the enhanced parameter of the original parameter list is less than the preset number of times, if the number of times of randomly initializing the enhanced parameter of the original parameter list is less than the preset number of times, repeatedly performing steps S30-S70, further defining an operation to be performed when the recognition models do not converge, optimizing the parameters of the original parameter list by adopting the artificial fish school algorithm for a plurality of times and targeting an identification effect of the plurality of recognition models, the method and the device have the advantages that the user satisfied optimization parameter list is obtained, the parameter performance of the optimization parameter list is guaranteed, and the data enhancement effect is further guaranteed.
In an embodiment, the data enhancement method includes a synonym replacement method, as shown in fig. 4, in step S30, converting the original training data by using each optimized parameter list, which includes the following steps:
s31: and determining enhancement parameters corresponding to the synonym replacement method in the optimized parameter list, wherein the enhancement parameters corresponding to the synonym replacement method comprise entity word category replacement probability and entity word replacement category.
In this embodiment, the data enhancement method in the optimized parameter list includes a synonym replacement method, and the enhancement parameter corresponding to the synonym replacement method is determined in the optimized parameter list, where the enhancement parameter corresponding to the synonym replacement method includes an entity word class replacement probability and an entity word replacement class.
S32: and acquiring a preset synonym dictionary which is pre-constructed by a user according to requirements, wherein in the preset synonym dictionary, the entity words which are not forbidden in synonym relation in the same entity category are used as synonyms of each other.
Before converting the original training data by using the data enhancement method and the corresponding enhancement parameters in each optimized parameter list, a preset synonym dictionary is required to be obtained and used as a source for converting entity words in the original training data, wherein the preset synonym dictionary is a dictionary which is pre-constructed by a user according to requirements and comprises entity words of different entity categories, the entity words of the same entity category are used as synonyms of each other in the preset synonym dictionary, the synonym relationship among specific entity words is forbidden in the preset synonym dictionary, and the entity words of which the synonym relationship is forbidden cannot be used as synonyms of each other.
In this embodiment, the size of the entity words in the preset synonym dictionary is increased by relaxing the determination condition of the synonym, and the entity words in the same entity category are used as the synonym, that is, if the new sentence obtained by replacing the word a in the sentence with the word B is still reasonable in semantics and syntax, the word B and the word a are in the same entity category, and the word B is a synonym of the word a, and the entity words in the same category are collected to form the preset synonym dictionary. For example, the grandchild monkey is pressed under the Wuxing mountain, in the sentence, the grandchild monkey can be replaced by the names of Laiwaiji, Bourwang and the like, and the grandchild monkey, Laiwaiji and Bourwang are synonyms of each other.
In this embodiment, the quality of the preset synonym dictionary is improved by prohibiting the synonymy relationship between specific words. In daily use, although some entity words belong to the same entity category, the synonyms are replaced into sentences to cause sentence grammar change, and at the moment, the synonymy relationship between two entities needs to be forbidden, namely, two entity words are not synonyms. For example, if the wuxing mountain is pressed below the wuxing mountain in the sentence, and the yellow river is replaced with the wuxing mountain in the sentence, the wuxing mountain and the yellow river are pressed below the yellow river as the sentence, and therefore, the wuxing mountain and the yellow river are not prohibited from having the synonym relationship in the preset synonym dictionary, and cannot be replaced as synonyms with each other when the synonym is replaced.
In this embodiment, the above-mentioned sentences are obtained by pressing the monkey King under the Wuxing mountain, and the synonyms are explained by using the terms such as Laifzu, Bull Queen and Huanghe as the entity words, which are only exemplary, and in other embodiments, other sentences and entity words may be used as examples for explanation.
The synonyms in the preset synonym dictionary can exist in a form of table 2, wherein table 2 comprises four columns, the first column is a sequence number, and the second column and the third column are different words: the fourth column is a relationship between the words A and B, if the words A can be replaced by the words B, the words A and B are synonyms of each other, and if the words B can not be replaced by the words A, the words A and B are not synonyms of each other. The contents of the preset synonym dictionary are shown in table 2 below:
TABLE 2
Figure BDA0003027493320000121
S33: and carrying out synonym replacement on the entity words in the original training data according to a preset synonym dictionary, the entity word category replacement probability and the entity word replacement category.
After the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category are obtained, synonym replacement is carried out on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category, data after synonym replacement are obtained, and then data processing is carried out on the data after synonym replacement according to other data enhancement methods and corresponding enhancement parameters in the optimized parameter list, so that manual construction data are obtained. The entity word class replacement probability is the replacement probability of the entity word replacement class, in the optimization parameter list, the probability distribution of replacement of each entity word class is p _ syn [ [ p _ (syn1) ], p _ (syn2), …, p _ (synK) ], and based on a preset synonym dictionary, k types of entity words in original training data are replaced by synonyms of the preset synonym dictionary according to the probability of p _ (syn, k).
In the embodiment, the enhancement parameters corresponding to the synonym replacement method are determined in the optimized parameter list, the enhancement parameters corresponding to the synonym replacement method comprise entity word class replacement probability and entity word replacement class, a preset synonym dictionary which is pre-constructed by a user according to requirements is obtained, entity words which are not forbidden in synonym relation in the same entity class are taken as synonyms of each other in the preset synonym dictionary, the entity words in original training data are replaced by the synonyms according to the preset synonym dictionary, the entity word class replacement probability and the entity word replacement class, the step of converting the original training data by utilizing each optimized parameter list is detailed, the size of the preset synonym dictionary is enlarged by relaxing the judgment condition of the entity word synonyms, the diversity of manually constructed data is improved, and a synonym relation-based prohibition mode is established, the quality of the preset synonym dictionary is continuously improved, so that the quality of artificially constructed data is ensured.
In an embodiment, as shown in fig. 5, in step S33, that is, performing synonym replacement on an entity word in original training data according to a preset synonym dictionary, an entity word category replacement probability, and an entity word replacement category, the method specifically includes the following steps:
s331: and determining whether the category of each entity word in the original training data belongs to the entity word replacement category.
After the synonym dictionary, the entity word class replacement probability and the entity word replacement class are preset, the class of each entity word in the original training data needs to be determined so as to determine whether each entity word in the original training data belongs to the entity word replacement class.
S332: and if the category of the entity word in the original training data belongs to the entity word replacement category, searching the synonym of the entity word in a preset synonym dictionary.
After determining whether each entity word in the original training data belongs to the entity word replacement category, if the category of one entity word in the original training data belongs to the entity word replacement category, which indicates that synonym replacement needs to be performed on the entity word in the original training data, all synonyms of the entity word in a preset synonym dictionary are needed for subsequent replacement.
S333: it is determined whether synonyms are prohibited between the entity and the synonyms for the entity.
After the synonym of the entity word is in the preset synonym dictionary, whether the synonym relation between the entity word and each synonym is forbidden is determined.
S334: and if the synonyms of the entity words and the entity words are not forbidden from being in the synonym relationship, selecting one synonym from a preset synonym dictionary as a replacement word according to the entity word category replacement probability so as to replace the entity word with the replacement word.
After determining whether the synonyms of the entity words and the entity words are forbidden to be used for synonym relationship, if the synonyms of the entity words and the entity words are not forbidden to be used for synonym relationship, replacing the entity words with the corresponding synonyms according to the entity word category replacement probability.
S335: if the synonym relationship between the entity word and the synonym of the entity word is forbidden, the synonym is not used as the replacement word of the entity word.
After determining whether the synonym relationship between the entity word and the corresponding synonym is forbidden, if the synonym relationship between the entity word and the synonym of the entity word is forbidden, skipping the synonym, namely not taking the synonym as the replacement of the entity word.
For example, the entity word replacement category includes 3 categories of a person name, a place name, and a mechanism name, where the entity word category replacement probability is p _ syn [0.30,0.60,0.10], that is, according to the synonym replacement method, the probability that the person name in the original training data is replaced is 0.30, the probability that the place name is replaced is 0.6, and the probability that the mechanism name is replaced is 0.1, if no synonym of the person name in the preset synonym dictionary is prohibited, each person name in a sentence in the original training data has a probability of 30% being replaced by a synonym of the person name in the preset synonym dictionary; and if a certain synonym of the name of the person in the preset synonym dictionary is forbidden to have the synonym relationship, skipping the synonym and not replacing, and replacing the name of the person by using other synonyms.
In this embodiment, the entity word replacement category includes 3 types of a person name, a place name, and an organization name, and the entity word category replacement probability is p _ syn [0.30,0.60,0.10], which is merely an exemplary illustration.
S336: and if the category of each entity word in the original training data does not belong to the entity word replacement category, not performing synonym replacement.
After determining whether each entity word in the original training data belongs to the entity word replacement category, if the category part of each entity word in the original training data belongs to the entity word replacement category, it indicates that synonym replacement is not required for the entity word in the original training data, and other data enhancement methods in the optimization parameter list can be executed.
In this embodiment, a step of performing synonym replacement on an entity word in original training data according to a preset synonym dictionary, an entity word category replacement probability and an entity word replacement category is defined by determining whether the category of each entity word in original training data belongs to an entity word replacement category, searching a corresponding synonym of the entity word in a preset synonym dictionary if the category of the entity word in the original training data belongs to the entity word replacement category, determining whether the synonym relationship between the entity word and the corresponding synonym is prohibited, and replacing the entity word with the corresponding synonym according to the entity word category replacement probability if the synonym relationship between the entity word and the corresponding synonym is not prohibited.
In an embodiment, the data enhancement method further includes a random replacement method, a random deletion method, a random exchange method, and a long sentence construction method, as shown in fig. 6, after step S33, that is, after performing synonym replacement on entity words in the original training data, the method further includes the following steps:
s34: and in the optimization parameter list, determining the random replacement probability of the random replacement method and determining the random deletion probability of the random deletion method.
In this embodiment, the data enhancement method further includes a random replacement method and a random deletion method, and the random replacement probability of the random replacement method and the random deletion probability of the random deletion method need to be determined in the optimized parameter list, so as to perform conversion processing on the original training data according to the random replacement probability and the random deletion probability.
S35: and determining the random exchange probability of the random exchange method and determining the sentence length set by the long sentence constructing method.
In this embodiment, the data enhancement method further includes a random exchange method and a long sentence construction method, and in the optimized parameter list, the random exchange probability of the random exchange method needs to be determined, and the sentence length set by the long sentence construction method needs to be determined, so as to perform conversion processing on the original training data according to the random exchange probability and the sentence length set by the long sentence construction method.
S36: and carrying out entity word replacement on each sentence in the original training data according to the random replacement probability, and carrying out entity word exchange of the same sentence on each sentence in the original training data according to the random exchange probability.
After the random replacement probability of the random replacement method is determined and the random exchange probability of the random exchange method is determined, entity word replacement is carried out on each sentence in the original training data according to the random replacement probability, and the entity word exchange of the same sentence is carried out on each sentence in the original training data according to the random exchange probability.
For example, in the optimized parameter list, the random replacement probability of the random replacement method is β 2, the random replacement probability of the random replacement method is β 3, and in each token (entity word) of each sentence in the original training data, the probability of β 2 is replaced by any other token in a dictionary (which may be a preset synonym dictionary), wherein the rule for selecting a token from the dictionary is: and (4) uniformly and randomly distributing and excluding other tokens to be randomly replaced in the original training data. Meanwhile, in each sentence of the original training data, the ith token and the jth token have the probability of beta 3 to carry out position exchange.
S37: and deleting entity words of each sentence in the original training data according to the random deletion probability to obtain processing data.
And after each sentence in the original training data is subjected to entity word replacement according to the random replacement probability and the same sentence entity word exchange is carried out on each sentence in the original training data according to the random exchange probability, entity word deletion is carried out on each sentence in the original training data according to the random deletion probability so as to obtain processing data.
For example, in the original training data, each token of each sentence is replaced with any other token in the dictionary with a probability of β 2, then in each sentence, the ith token and the jth token are subjected to position exchange with a probability of β 3, and then each token of each sentence is deleted with a probability of β 4, and the processed data is obtained.
S38: and splicing each sentence in the processed data to make the length of the sentence after processing be the length of the sentence.
After the processing data is obtained, each sentence in the processing data is spliced, so that the length of the sentence after the processing is finished is the length of the sentence.
For example, the sentence length set by the long sentence construction method is 100, the sentence length of each sentence in the processed data is counted to obtain the 90 th percentile of the sentence length, the sentences with the sentence lengths smaller or smaller than the 90 th percentile are paired pairwise to be spliced into a longer spliced sentence (the sequence of the two sentences is random), and then the part with the length exceeding 100 in the spliced sentence is deleted, so that the sentence length of each sentence in the processed data is 100.
In this embodiment, the sentence length set by the long sentence construction method is 100, and the pairwise matching and splicing of the sentences whose sentence lengths are smaller than or equal to the 90 th percentile is only an exemplary description, in other embodiments, the sentence length set by the long sentence construction method may also be other values, and the sentences whose sentence lengths are other percentiles may also be pairwise matched and spliced, which is not described herein again.
In the embodiment, after the synonym replacement is performed on the entity words in the original training data, the random replacement probability of the random replacement method is determined in the optimized parameter list, the random deletion probability of the random deletion method is determined, the random exchange probability of the random exchange method is determined, the sentence length set by the long sentence constructing method is determined, the entity word replacement is performed on each sentence in the original training data according to the random replacement probability, the same-sentence entity word exchange is performed on each sentence in the original training data according to the random exchange probability, the entity word deletion is performed on each sentence in the original training data according to the random deletion probability to obtain the processing data, each sentence in the processing data is spliced to make the length of the sentence after the processing be the sentence length, the step of converting the original training data by using each optimized parameter list is further refined, and a plurality of data enhancement methods are adopted to convert the original training data, so that the diversity of artificially constructed data is further increased, and the accuracy of the recognition model training set is ensured.
In an embodiment, the data enhancement method includes a synonym replacement method, as shown in fig. 7, in step S50, that is, determining whether a convergence model exists in the multiple recognition models according to the test result, specifically including the following steps:
s51: a highest recognition score for each word in the test set is determined for the plurality of recognition models.
After the plurality of recognition models are tested using the raw test data as a test set, a highest recognition score for recognizing each word in the test set among the plurality of recognition models is determined.
The score for identifying each word in the test set by the identification model is determined by the following formula:
Figure BDA0003027493320000161
wherein, scoretAnd (3) scoring for identifying the t-th word in the test set by the identification model, wherein recall is the recall rate of the entity word, and precision is the precision for recalling the entity word by the identification model.
For example, the number of recognition models is 3, and after the original test data is used as the test set to test A, B, C three recognition models, the recognition scores of A, B, C three recognition models for the t-th word in the test set are 0.6, 0.8 and 0.9 respectively, and then the highest recognition score for recognizing the t-th word in the test set in A, B, C three recognition models is 0.9.
In this embodiment, the number of the recognition models is 3, and the recognition scores of the tth word in the test set are respectively 0.6, 0.8 and 0.9, which are only exemplary illustrations, in other embodiments, the number of the recognition models may also be other values, and the recognition score of the tth word in the test set may also be other values, and details are not repeated herein.
S52: it is determined whether the highest recognition score satisfies a convergence condition.
After determining the highest recognition score for recognizing each word in the test set in the plurality of recognition models, determining whether the highest recognition score for recognizing each word in the test set in the plurality of recognition models satisfies a convergence condition.
S53: and if the highest recognition score meets the convergence condition, determining that the convergence model meeting the convergence condition exists in the plurality of recognition models, wherein the recognition model corresponding to the highest recognition score is the convergence model.
After determining whether the highest recognition score meets the convergence condition, if the highest recognition score meets the convergence condition and indicates that the recognition effect of the existing recognition model meets the requirement, determining that the convergence model meeting the convergence condition exists in the multiple recognition models, wherein the recognition model corresponding to the highest recognition score is the convergence model, and the optimization parameter list corresponding to the convergence model can be used as a target data enhancement parameter list.
S54: and if the highest recognition score does not meet the convergence condition, determining that no convergence model meeting the convergence condition exists in the plurality of recognition models.
After whether the highest recognition score meets the convergence condition is determined, if the highest recognition score meets the convergence condition and the recognition effect of no recognition model meets the requirement is represented, it is determined that no convergence model meeting the convergence condition exists in the recognition models, the optimization parameter list of the round is unavailable, and the manual fish swarm algorithm is required to be used for carrying out re-iterative optimization.
In this embodiment, a maximum recognition score for recognizing each word in a test set in a plurality of recognition models is determined, whether the maximum recognition score meets a convergence condition is determined, if the maximum recognition score meets the convergence condition, a convergence model meeting the convergence condition is determined to exist in the plurality of recognition models, the recognition model corresponding to the maximum recognition score is a convergence model, if the maximum recognition score does not meet the convergence condition, it is determined that no convergence model meeting the convergence condition exists in the plurality of recognition models, a determination process for determining whether a convergence model exists in the plurality of recognition models is determined, a recognition effect of a recognition model on a test set is used as a concentration of an artificial fish swarm algorithm, the recognition effect of the recognition model on the test set is used as a target for optimizing parameters of a data enhancement model, and a data enhancement strategy with a good effect is obtained at a low cost.
In an embodiment, the data enhancement method includes a synonym replacement method, as shown in fig. 8, in step S52, that is, determining whether the highest recognition score satisfies the convergence condition, specifically includes the following steps:
s521: a user configured convergence parameter is determined.
S522: determining a first highest recognition score for recognizing the tth word in the test set in the plurality of recognition models;
s523: determining a second highest recognition score for recognizing the t-1 th word in the test set in the plurality of recognition models;
s524: subtracting the second highest identification score from the first highest identification score to obtain a highest identification score difference;
s525: determining whether a ratio of a highest recognition score difference to a second highest recognition score is less than a convergence parameter;
s526: if the ratio of the difference of the highest recognition scores to the second highest recognition score is less than the convergence parameter, determining that the highest recognition score meets the convergence condition;
s527: and if the ratio of the difference of the highest identification scores to the second highest identification score is not less than the convergence parameter, determining that the highest identification score does not satisfy the convergence condition.
After determining the highest recognition score for recognizing each word in the test set among the plurality of recognition models, determining whether the highest recognition score of the plurality of recognition models satisfies a convergence condition by:
Figure BDA0003027493320000171
wherein, maxscoretFor the highest recognition score for the tth word in the test set, i.e., the first highest recognition score, maxscore, among the plurality of recognition modelst-1For the highest recognition score, i.e., the second highest recognition score, of the plurality of recognition models for the t-1 th word of the test set, α is a user-configured convergence parameter (which may be 0.01).
In the above formula, if the first highest recognition score maxscoretWith the second highest recognition score maxscoret-1Highest score difference maxscore betweent-maxscoret-1Dividing by the second highest recognition score to obtain a convergence value
Figure BDA0003027493320000172
If it is
Figure BDA0003027493320000173
If the recognition score is smaller than the convergence parameter alpha, determining that the highest recognition score meets the convergence condition; if it is
Figure BDA0003027493320000174
And if the recognition score is not less than the convergence parameter alpha, determining that the highest recognition score does not meet the convergence condition.
In this embodiment, a convergence parameter configured by a user is determined, a first highest recognition score for recognizing a t-th word in a test set in a plurality of recognition models is determined, a second highest recognition score for recognizing a t-1-th word in the test set in the plurality of recognition models is determined, the second highest recognition score is subtracted from the first highest recognition score to obtain a highest recognition score difference, whether a ratio of the highest recognition score difference to the second highest recognition score is smaller than the convergence parameter is determined, and if the ratio of the highest recognition score difference to the second highest recognition score is smaller than the convergence parameter, the highest recognition score is determined to satisfy a convergence condition; and if the ratio of the difference of the highest identification scores to the second highest identification score is not less than the convergence parameter, determining that the highest identification scores do not meet the convergence condition, determining whether the highest identification scores meet the convergence condition or not, and providing a judgment basis for determining whether the model converges or not according to the determined highest identification scores.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, a data enhancement device based on a deep learning model is provided, and the data enhancement device based on the deep learning model corresponds to the data enhancement method based on the deep learning model in the above embodiment one to one. As shown in fig. 9, the data enhancement apparatus based on the deep learning model includes an obtaining module 901, an initializing module 902, a converting module 903, a testing module 904, an outputting module 905, and an enhancing module 906. The functional modules are explained in detail as follows:
an obtaining module 901, configured to obtain original training data and original test data that are manually labeled, and obtain an original parameter list, where the original parameter list is formed by a data enhancement method and enhancement parameters corresponding to the data enhancement method;
an initialization module 902, configured to perform random initialization on the enhanced parameters in the original parameter list according to an artificial fish swarm algorithm to obtain multiple optimized parameter lists;
a conversion module 903, configured to convert the original training data by using each optimized parameter list to obtain corresponding artificially constructed data, and mix the original training data and the corresponding artificially constructed data to obtain a plurality of training sets;
a testing module 904, configured to respectively train to obtain multiple recognition models by using the multiple training sets, and test the multiple recognition models by using the original test data as a test set, so as to determine whether a model meeting a convergence condition exists in the multiple recognition models;
an output module 905, configured to output, if the model meeting the convergence condition exists in the multiple identification models, an optimization parameter list corresponding to the model meeting the convergence condition as a target data enhancement parameter list;
an enhancing module 906, configured to perform data enhancement on the original training data by using the target data enhancement parameter list, so as to obtain a training set of a named entity recognition model.
Further, the deep learning model-based data enhancement apparatus further includes a loop module 907, and after determining whether a model satisfying a convergence condition exists in the plurality of recognition models, the loop module 907 is specifically configured to:
if the model meeting the convergence condition does not exist in the plurality of identification models, randomly initializing the enhanced parameters in the original parameter list according to an artificial fish swarm algorithm again to obtain a plurality of optimized parameter lists after random initialization, and counting;
determining whether the times of random initialization of the enhanced parameters in the original parameter list is less than a preset time;
if the times of carrying out random initialization on the enhanced parameters in the original parameter list are not less than the preset times, stopping carrying out the random initialization on the enhanced parameters in the original parameter list;
and if the times of randomly initializing the enhanced parameters in the original parameter list are less than the preset times, training according to the randomly initialized optimized parameter list to obtain a plurality of new recognition models, testing the new recognition models to obtain the target data enhanced parameter list, and obtaining a training set of the named entity recognition model by using the target data enhanced parameter list.
Further, the data enhancement method includes a synonym replacement method, and the conversion module 903 is specifically configured to:
determining an enhanced parameter corresponding to the synonym replacement method in the optimized parameter list, wherein the enhanced parameter corresponding to the synonym replacement method comprises an entity word category replacement probability and an entity word replacement category;
acquiring a preset synonym dictionary which is pre-constructed by a user according to requirements, wherein entity words which are not forbidden in synonym relation in the same entity category are used as synonyms of each other in the preset synonym dictionary;
and carrying out synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category.
Further, the conversion module 903 is specifically further configured to:
determining whether the category of each entity word in the original training data belongs to the entity word replacement category;
if the category of the entity word in the original training data belongs to the entity word replacement category, searching the synonym of the entity word in the preset synonym dictionary;
determining whether a synonym relationship is prohibited between the entity and a synonym of the entity;
and if the synonyms of the entity words and the entity words are not forbidden in synonym relationship, selecting one synonym from the preset synonym dictionary as a replacement word according to the entity word category replacement probability so as to replace the entity word with the replacement word.
Further, the data enhancement method further includes a random replacement method, a random deletion method, a random exchange method, and a long sentence constructing method, and the conversion module 903 is further specifically configured to:
in the optimization parameter list, determining the random replacement probability of the random replacement method and determining the random deletion probability of the random deletion method;
determining the random exchange probability of the random exchange method and determining the sentence length set by the long sentence constructing method;
carrying out entity word replacement on each sentence in the original training data according to the random replacement probability, and carrying out entity word exchange of the same sentence on each sentence in the original training data according to the random exchange probability;
deleting entity words of each sentence in the original training data according to the random deletion probability to obtain processing data;
and splicing each sentence in the processing data to ensure that the length of the sentence after the processing is finished is the sentence length.
Further, the test module 904 is specifically configured to:
determining a highest recognition score for recognizing each word in the test set in the plurality of recognition models;
determining whether the highest recognition score satisfies the convergence condition;
if the highest recognition score meets the convergence condition, determining that the convergence model meeting the convergence condition exists in the plurality of recognition models, wherein the recognition model corresponding to the highest recognition score is the convergence model;
and if the highest identification score does not meet the convergence condition, determining that the convergence model meeting the convergence condition does not exist in the plurality of identification models.
Further, the test module 905 is specifically further configured to:
determining a convergence parameter configured by a user;
determining a first highest recognition score for recognizing a tth word in the test set in the plurality of recognition models;
determining a second highest recognition score for recognizing the t-1 th word in the test set in the plurality of recognition models;
subtracting the second highest identification score from the first highest identification score to obtain a highest identification score difference;
determining whether a ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;
determining that the highest recognition score satisfies the convergence condition if the ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;
determining that the highest recognition score does not satisfy the convergence condition if the ratio of the highest recognition score difference to the second highest recognition score is not less than the convergence parameter.
For specific definition of the data enhancement device based on the deep learning model, refer to the definition of the data enhancement method based on the deep learning model above, and are not described herein again. The various modules in the deep learning model-based data enhancement device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing relevant data used by or produced by data enhancement methods such as original training data, original test data, an original parameter list, artificial construction data, an optimized parameter list and a plurality of recognition models. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of data enhancement based on a deep learning model.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of the deep learning model-based data enhancement method are implemented.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned deep learning model-based data enhancement method.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A data enhancement method based on a deep learning model is characterized by comprising the following steps:
acquiring original training data and original test data which are marked manually, and acquiring an original parameter list, wherein the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method;
randomly initializing the enhanced parameters in the original parameter list according to an artificial fish swarm algorithm to obtain a plurality of optimized parameter lists;
converting the original training data by utilizing each optimization parameter list to obtain corresponding artificially constructed data, and mixing the original training data and the corresponding artificially constructed data to obtain a plurality of training sets;
respectively training by using the training sets to obtain a plurality of recognition models, and testing the recognition models by using the original test data as a test set to determine whether a model meeting a convergence condition exists in the recognition models;
if the model meeting the convergence condition exists in the plurality of identification models, outputting an optimization parameter list corresponding to the model meeting the convergence condition as a target data enhancement parameter list;
and performing data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of the named entity recognition model.
2. The deep learning model-based data enhancement method of claim 1, wherein after determining whether there is a model of the plurality of recognition models that satisfies a convergence condition, the method further comprises:
if the model meeting the convergence condition does not exist in the plurality of identification models, randomly initializing the enhanced parameters in the original parameter list according to an artificial fish swarm algorithm again to obtain a plurality of optimized parameter lists after random initialization, and counting;
determining whether the times of random initialization of the enhanced parameters in the original parameter list is less than a preset time;
if the times of carrying out random initialization on the enhanced parameters in the original parameter list are not less than the preset times, stopping carrying out the random initialization on the enhanced parameters in the original parameter list;
and if the times of randomly initializing the enhanced parameters in the original parameter list are less than the preset times, training according to the randomly initialized optimized parameter list to obtain a plurality of new recognition models, testing the new recognition models to obtain the target data enhanced parameter list, and obtaining a training set of the named entity recognition model by using the target data enhanced parameter list.
3. The deep learning model-based data enhancement method of claim 1, wherein the data enhancement method comprises a synonym replacement method, and the converting the original training data with each of the optimized parameter lists comprises:
determining an enhanced parameter corresponding to the synonym replacement method in the optimized parameter list, wherein the enhanced parameter corresponding to the synonym replacement method comprises an entity word category replacement probability and an entity word replacement category;
acquiring a preset synonym dictionary which is pre-constructed by a user according to requirements, wherein entity words which are not forbidden in synonym relation in the same entity category are used as synonyms of each other in the preset synonym dictionary;
and carrying out synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category.
4. The deep learning model-based data enhancement method of claim 3, wherein the performing synonym replacement on the entity words in the original training data according to the preset synonym dictionary, the entity word class replacement probability and the entity word replacement class comprises:
determining whether the category of each entity word in the original training data belongs to the entity word replacement category;
if the category of the entity word in the original training data belongs to the entity word replacement category, searching the synonym of the entity word in the preset synonym dictionary;
determining whether a synonym relationship is prohibited between the entity and a synonym of the entity;
and if the synonyms of the entity words and the entity words are not forbidden in synonym relationship, selecting one synonym from the preset synonym dictionary as a replacement word according to the entity word category replacement probability so as to replace the entity word with the replacement word.
5. The deep learning model-based data enhancement method of claim 4, wherein the data enhancement method further comprises a random replacement method, a random deletion method, a random exchange method and a long sentence construction method, and after synonym replacement is performed on the entity words in the original training data, the method further comprises:
in the optimization parameter list, determining the random replacement probability of the random replacement method and determining the random deletion probability of the random deletion method;
determining the random exchange probability of the random exchange method and determining the sentence length set by the long sentence constructing method;
carrying out entity word replacement on each sentence in the original training data according to the random replacement probability, and carrying out entity word exchange of the same sentence on each sentence in the original training data according to the random exchange probability;
deleting entity words of each sentence in the original training data according to the random deletion probability to obtain processing data;
and splicing each sentence in the processing data to ensure that the length of the sentence after the processing is finished is the sentence length.
6. The deep learning model-based data enhancement method of any one of claims 1-5, wherein the determining whether a converged model exists in the plurality of recognition models comprises:
determining a highest recognition score for recognizing each word in the test set in the plurality of recognition models;
determining whether the highest recognition score satisfies the convergence condition;
if the highest recognition score meets the convergence condition, determining that the convergence model meeting the convergence condition exists in the plurality of recognition models, wherein the recognition model corresponding to the highest recognition score is the convergence model;
and if the highest identification score does not meet the convergence condition, determining that the convergence model meeting the convergence condition does not exist in the plurality of identification models.
7. The deep learning model-based data enhancement method of claim 6, wherein the determining whether the highest recognition score satisfies the convergence condition comprises:
determining a convergence parameter configured by a user;
determining a first highest recognition score for recognizing a tth word in the test set in the plurality of recognition models;
determining a second highest recognition score for recognizing the t-1 th word in the test set in the plurality of recognition models;
subtracting the second highest identification score from the first highest identification score to obtain a highest identification score difference;
determining whether a ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;
determining that the highest recognition score satisfies the convergence condition if the ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;
determining that the highest recognition score does not satisfy the convergence condition if the ratio of the highest recognition score difference to the second highest recognition score is not less than the convergence parameter.
8. A data enhancement device based on a deep learning model is characterized by comprising:
the system comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is used for acquiring original training data and original test data which are marked manually and acquiring an original parameter list, and the original parameter list is composed of a data enhancement method and enhancement parameters corresponding to the data enhancement method;
the initialization module is used for randomly initializing the enhanced parameters in the original parameter list according to an artificial fish swarm algorithm so as to obtain a plurality of optimized parameter lists;
the conversion module is used for converting the original training data by utilizing each optimization parameter list to obtain corresponding artificial construction data, and mixing the original training data and the corresponding artificial construction data to obtain a plurality of training sets;
the testing module is used for respectively training the training sets to obtain a plurality of recognition models, and testing the recognition models by taking the original test data as a test set so as to determine whether a model meeting a convergence condition exists in the recognition models;
the output module is used for outputting an optimization parameter list corresponding to the model meeting the convergence condition as a target data enhancement parameter list if the model meeting the convergence condition exists in the plurality of identification models;
and the enhancement module is used for performing data enhancement on the original training data by utilizing the target data enhancement parameter list so as to obtain a training set of the named entity recognition model.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor when executing the computer program implements the steps of the deep learning model based data enhancement method according to any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the deep learning model-based data enhancement method according to any one of claims 1 to 7.
CN202110420110.3A 2021-04-19 2021-04-19 Data enhancement method, device, equipment and medium based on deep learning model Active CN113158652B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110420110.3A CN113158652B (en) 2021-04-19 2021-04-19 Data enhancement method, device, equipment and medium based on deep learning model
PCT/CN2021/096475 WO2022222224A1 (en) 2021-04-19 2021-05-27 Deep learning model-based data augmentation method and apparatus, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110420110.3A CN113158652B (en) 2021-04-19 2021-04-19 Data enhancement method, device, equipment and medium based on deep learning model

Publications (2)

Publication Number Publication Date
CN113158652A true CN113158652A (en) 2021-07-23
CN113158652B CN113158652B (en) 2024-03-19

Family

ID=76868692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110420110.3A Active CN113158652B (en) 2021-04-19 2021-04-19 Data enhancement method, device, equipment and medium based on deep learning model

Country Status (2)

Country Link
CN (1) CN113158652B (en)
WO (1) WO2022222224A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116911305A (en) * 2023-09-13 2023-10-20 中博信息技术研究院有限公司 Chinese address recognition method based on fusion model

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244445B (en) * 2022-12-29 2023-12-12 中国航空综合技术研究所 Aviation text data labeling method and labeling system thereof
CN116451690A (en) * 2023-03-21 2023-07-18 麦博(上海)健康科技有限公司 Medical field named entity identification method
CN116501979A (en) * 2023-06-30 2023-07-28 北京水滴科技集团有限公司 Information recommendation method, information recommendation device, computer equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145965A (en) * 2018-08-02 2019-01-04 深圳辉煌耀强科技有限公司 Cell recognition method and device based on random forest disaggregated model
CN110516835A (en) * 2019-07-05 2019-11-29 电子科技大学 A kind of Multi-variable Grey Model optimization method based on artificial fish-swarm algorithm
CN111967604A (en) * 2019-05-20 2020-11-20 国际商业机器公司 Data enhancement for text-based AI applications

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11093707B2 (en) * 2019-01-15 2021-08-17 International Business Machines Corporation Adversarial training data augmentation data for text classifiers
CN110543906B (en) * 2019-08-29 2023-06-16 彭礼烨 Automatic skin recognition method based on Mask R-CNN model
CN111738004B (en) * 2020-06-16 2023-10-27 中国科学院计算技术研究所 Named entity recognition model training method and named entity recognition method
CN111832294B (en) * 2020-06-24 2022-08-16 平安科技(深圳)有限公司 Method and device for selecting marking data, computer equipment and storage medium
CN111738007B (en) * 2020-07-03 2021-04-13 北京邮电大学 Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN112257441B (en) * 2020-09-15 2024-04-05 浙江大学 Named entity recognition enhancement method based on counterfactual generation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145965A (en) * 2018-08-02 2019-01-04 深圳辉煌耀强科技有限公司 Cell recognition method and device based on random forest disaggregated model
CN111967604A (en) * 2019-05-20 2020-11-20 国际商业机器公司 Data enhancement for text-based AI applications
CN110516835A (en) * 2019-07-05 2019-11-29 电子科技大学 A kind of Multi-variable Grey Model optimization method based on artificial fish-swarm algorithm

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116911305A (en) * 2023-09-13 2023-10-20 中博信息技术研究院有限公司 Chinese address recognition method based on fusion model

Also Published As

Publication number Publication date
CN113158652B (en) 2024-03-19
WO2022222224A1 (en) 2022-10-27

Similar Documents

Publication Publication Date Title
CN113158652A (en) Data enhancement method, device, equipment and medium based on deep learning model
CN106202380B (en) Method and system for constructing classified corpus and server with system
CN105205043A (en) Classification method and system of emotions of news readers
DE102021004562A1 (en) Modification of scene graphs based on natural language commands
CN116010581A (en) Knowledge graph question-answering method and system based on power grid hidden trouble shooting scene
CN112101042A (en) Text emotion recognition method and device, terminal device and storage medium
CN110502620B (en) Method, system and computer equipment for generating guide diagnosis similar problem pairs
CN113627159B (en) Training data determining method, device, medium and product of error correction model
CN116910185B (en) Model training method, device, electronic equipment and readable storage medium
CN109858035A (en) A kind of sensibility classification method, device, electronic equipment and readable storage medium storing program for executing
KR102269606B1 (en) Method, apparatus and computer program for analyzing new contents for solving cold start
CN117932058A (en) Emotion recognition method, device and equipment based on text analysis
CN112948582A (en) Data processing method, device, equipment and readable medium
WO2023245523A1 (en) Method and apparatus for generating training data
CN116822530A (en) Knowledge graph-based question-answer pair generation method
CN110162615A (en) A kind of intelligent answer method, apparatus, electronic equipment and storage medium
CN109657079A (en) A kind of Image Description Methods and terminal device
CN115098665A (en) Method, device and equipment for expanding session data
CN115512374A (en) Deep learning feature extraction and classification method and device for table text
CN113761874A (en) Event reality prediction method and device, electronic equipment and storage medium
Xia et al. Generating Questions Based on Semi-Automated and End-to-End Neural Network.
CN115238903B (en) Model compression method, system, electronic device and storage medium
CN117763128B (en) Man-machine interaction data processing method, server, storage medium and program product
CN115809329B (en) Method for generating abstract of long text
CN117216193B (en) Controllable text generation method and device based on large language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant