CN116127925A - Text data enhancement method and device based on destruction processing of text - Google Patents

Text data enhancement method and device based on destruction processing of text Download PDF

Info

Publication number
CN116127925A
CN116127925A CN202310364625.5A CN202310364625A CN116127925A CN 116127925 A CN116127925 A CN 116127925A CN 202310364625 A CN202310364625 A CN 202310364625A CN 116127925 A CN116127925 A CN 116127925A
Authority
CN
China
Prior art keywords
text
vector
module
model
restored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310364625.5A
Other languages
Chinese (zh)
Other versions
CN116127925B (en
Inventor
徐琳
王芳
暴宇健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Longzhi Digital Technology Service Co Ltd
Original Assignee
Beijing Longzhi Digital Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Longzhi Digital Technology Service Co Ltd filed Critical Beijing Longzhi Digital Technology Service Co Ltd
Priority to CN202310364625.5A priority Critical patent/CN116127925B/en
Publication of CN116127925A publication Critical patent/CN116127925A/en
Application granted granted Critical
Publication of CN116127925B publication Critical patent/CN116127925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The disclosure relates to the technical field of text processing, and provides a text data enhancement method and device based on destructive processing of texts. The method comprises the following steps: the method comprises the steps of constructing a text diffusion model, wherein the text diffusion model comprises a forward module and a reverse module, the forward module is constructed according to various damage treatments, the reverse module is obtained by model training, and the forward module and the reverse module realize opposite operations; converting an original text in a text data set into a text vector by utilizing a forward module of a text diffusion model, and obtaining a damage vector corresponding to the original text by carrying out continuous and repeated damage processing on the text vector; the reverse module of the text diffusion model is utilized to continuously restore the damaged vector for a plurality of times to obtain a restored vector corresponding to the original text, and the restored vector is converted into a text format to obtain a restored text corresponding to the original text; a data-enhanced text dataset is generated using the original text and the restored text.

Description

Text data enhancement method and device based on destruction processing of text
Technical Field
The disclosure relates to the technical field of text processing, and in particular relates to a text data enhancement method and device based on destructive processing of text.
Background
When a machine learning model is used for training related tasks of natural language understanding, the problem of insufficient labeling corpus is always received, and particularly in the deep learning age, the requirement on the amount of language materials is more urgent. However, many times, enough corpus cannot be obtained in time for training, which puts high demands on data enhancement of the text. The data enhancement is to construct the same type of artificial data which is similar to the existing data as much as possible by using the existing data in some ways, and the artificial data is similar to the original data as much as possible and cannot be completely the same, so that when the data is used for training, the model training can generate a forward effect, the model precision is improved, and the model overfitting is reduced. Existing text data enhancement methods based on the destruction of text can be broadly divided into two types. The first method is to make some changes to the original text with rules, thus creating new samples, which often cause the enhanced new sentence to be unsmoothly semantic, or to have deviated much from the original sentence semantic, thus affecting the enhancement effect. The second method is to train a language model by means of autoregressive, and use the trained language model to enhance data, but the autoregressive training method limits the understanding ability of the model to the text, and the model can only see the first half of the text all the time, but can not understand the content in the text from the full-text perspective.
In the process of implementing the disclosed concept, the inventor finds that at least the following technical problems exist in the related art: the problem that the text obtained by the traditional data enhancement method deviates from the original text.
Disclosure of Invention
In view of the above, the embodiments of the present disclosure provide a text data enhancement method, apparatus, electronic device and computer readable storage medium based on the destruction processing of text, so as to solve the problem that in the prior art, the text obtained by the conventional data enhancement method deviates from the original text.
In a first aspect of an embodiment of the present disclosure, there is provided a text data enhancement method based on a destruction process for text, including: acquiring a text data set to be enhanced by data; the method comprises the steps of constructing a text diffusion model, wherein the text diffusion model comprises a forward module and a reverse module, the forward module is constructed according to various damage treatments, the reverse module is obtained by model training, and the forward module and the reverse module realize opposite operations; converting an original text in a text data set into a text vector by utilizing a forward module of a text diffusion model, and obtaining a damage vector corresponding to the original text by carrying out continuous and repeated damage processing on the text vector; the reverse module of the text diffusion model is utilized to continuously restore the damaged vector for a plurality of times to obtain a restored vector corresponding to the original text, the restored vector is converted into a text format to obtain a restored text corresponding to the original text, wherein each restoring process corresponds to one damaging process, and the restoring process and the damaging process corresponding to each other are mutually reverse processes; a data-enhanced text dataset is generated using the original text and the restored text.
In a second aspect of the embodiments of the present disclosure, there is provided a text data enhancement apparatus based on a destruction process of text, including: an acquisition module configured to acquire a text data set to be data enhanced; the text diffusion module comprises a forward module and a reverse module, wherein the forward module is constructed according to various damage processes, the reverse module is obtained by model training, and the forward module and the reverse module realize opposite operations; the destruction module is configured to convert the original text in the text data set into a text vector by utilizing the forward module of the text diffusion model, and destroy the text vector continuously for a plurality of times to obtain a destruction vector corresponding to the original text; the restoring module is configured to continuously restore the damaged vector for a plurality of times by utilizing the reversing module of the text diffusion model to obtain a restored vector corresponding to the original text, and convert the restored vector into a text format to obtain a restored text corresponding to the original text, wherein each restoring process corresponds to one damaging process, and the restoring process and the damaging process which correspond to each other are mutually reversed processes; a generation module configured to generate a data-enhanced text dataset using the original text and the restored text.
In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.
Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: because the embodiments of the present disclosure enhance text data sets by obtaining the data to be enhanced; the method comprises the steps of constructing a text diffusion model, wherein the text diffusion model comprises a forward module and a reverse module, the forward module is constructed according to various damage treatments, the reverse module is obtained by model training, and the forward module and the reverse module realize opposite operations; converting an original text in a text data set into a text vector by utilizing a forward module of a text diffusion model, and obtaining a damage vector corresponding to the original text by carrying out continuous and repeated damage processing on the text vector; the reverse module of the text diffusion model is utilized to continuously restore the damaged vector for a plurality of times to obtain a restored vector corresponding to the original text, the restored vector is converted into a text format to obtain a restored text corresponding to the original text, wherein each restoring process corresponds to one damaging process, and the restoring process and the damaging process corresponding to each other are mutually reverse processes; by using the original text and the restored text to generate a text data set with enhanced data, the problem that the text obtained by the traditional data enhancement method deviates from the original text in the prior art can be solved by adopting the technical means, and the text obtained by the data enhancement method accords with the text data distribution of the original text.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a scene schematic diagram of an application scene of an embodiment of the present disclosure;
FIG. 2 is a flow diagram of a method for enhancing text data based on destruction of text provided by an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a text data enhancement device based on destructive processing of text according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
A text data enhancement method and apparatus based on a destruction process of text according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.
Fig. 1 is a scene diagram of an application scene of an embodiment of the present disclosure. The application scenario may include terminal devices 101, 102, and 103, server 104, and network 105.
The terminal devices 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, and 103 are hardware, they may be various electronic devices having a display screen and supporting communication with the server 104, including but not limited to smartphones, tablets, laptop and desktop computers, etc.; when the terminal devices 101, 102, and 103 are software, they may be installed in the electronic device as above. Terminal devices 101, 102, and 103 may be implemented as multiple software or software modules, or as a single software or software module, as embodiments of the present disclosure are not limited in this regard. Further, various applications, such as a data processing application, an instant messaging tool, social platform software, a search class application, a shopping class application, and the like, may be installed on the terminal devices 101, 102, and 103.
The server 104 may be a server that provides various services, for example, a background server that receives a request transmitted from a terminal device with which communication connection is established, and the background server may perform processing such as receiving and analyzing the request transmitted from the terminal device and generate a processing result. The server 104 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in the embodiments of the present disclosure.
The server 104 may be hardware or software. When the server 104 is hardware, it may be various electronic devices that provide various services to the terminal devices 101, 102, and 103. When the server 104 is software, it may be a plurality of software or software modules providing various services to the terminal devices 101, 102, and 103, or may be a single software or software module providing various services to the terminal devices 101, 102, and 103, which is not limited by the embodiments of the present disclosure.
The network 105 may be a wired network using coaxial cable, twisted pair wire, and optical fiber connection, or may be a wireless network that can implement interconnection of various communication devices without wiring, for example, bluetooth (Bluetooth), near field communication (Near Field Communication, NFC), infrared (Infrared), etc., which are not limited by the embodiments of the present disclosure.
The user can establish a communication connection with the server 104 via the network 105 through the terminal devices 101, 102, and 103 to receive or transmit information or the like. It should be noted that the specific types, numbers and combinations of the terminal devices 101, 102 and 103, the server 104 and the network 105 may be adjusted according to the actual requirements of the application scenario, which is not limited by the embodiment of the present disclosure.
Fig. 2 is a flow chart of a method for enhancing text data based on destruction of text according to an embodiment of the present disclosure. The text data enhancement method of fig. 2 based on the destruction of text may be performed by the computer or server of fig. 1, or software on the computer or server. As shown in fig. 2, the text data enhancement method based on the destruction processing of the text includes:
s201, acquiring a text data set to be enhanced by data;
s202, a text diffusion model is built, wherein the text diffusion model comprises a forward module and a reverse module, the forward module is built according to various damage treatments, the reverse module is obtained by model training, and the forward module and the reverse module realize opposite operations;
s203, converting an original text in a text data set into a text vector by utilizing a forward module of a text diffusion model, and obtaining a damage vector corresponding to the original text by carrying out damage processing on the text vector continuously for a plurality of times;
s204, carrying out recovery processing on the damage vector for a plurality of times continuously by utilizing a reverse module of the text diffusion model to obtain a recovery vector corresponding to the original text, and converting the recovery vector into a text format to obtain a recovery text corresponding to the original text, wherein each recovery processing corresponds to one damage processing, and the recovery processing and the damage processing corresponding to each other are mutually reverse processes;
s205, generating a text data set with enhanced data by using the original text and the restored text.
The conventional text diffusion model has two processes, namely a diffusion process and a back diffusion process. According to the embodiment of the disclosure, the diffusion process is replaced by the constructed forward module, the reverse diffusion process is replaced by the reverse module, the precision of the constructed text diffusion model is higher than that of the traditional text diffusion model, and the text obtained by data enhancement accords with the original text. In the embodiments of the present disclosure, the forward module may be regarded as performing various kinds of destruction processing, and the reverse module may be regarded as a neural network model trained to detect the operation of the forward module and perform the inverse of the operation of the forward module.
According to the technical scheme provided by the embodiment of the disclosure, a text data set to be enhanced by data is obtained; the method comprises the steps of constructing a text diffusion model, wherein the text diffusion model comprises a forward module and a reverse module, the forward module is constructed according to various damage treatments, the reverse module is obtained by model training, and the forward module and the reverse module realize opposite operations; converting an original text in a text data set into a text vector by utilizing a forward module of a text diffusion model, and obtaining a damage vector corresponding to the original text by carrying out continuous and repeated damage processing on the text vector; the reverse module of the text diffusion model is utilized to continuously restore the damaged vector for a plurality of times to obtain a restored vector corresponding to the original text, the restored vector is converted into a text format to obtain a restored text corresponding to the original text, wherein each restoring process corresponds to one damaging process, and the restoring process and the damaging process corresponding to each other are mutually reverse processes; by using the original text and the restored text to generate a text data set with enhanced data, the problem that the text obtained by the traditional data enhancement method deviates from the original text in the prior art can be solved by adopting the technical means, and the text obtained by the data enhancement method accords with the text data distribution of the original text.
A destruction process comprising: pooling operations, blurring operations, and masking operations.
Converting the original text into a text vector by using a forward module: and performing hot-independent coding processing on the original text to obtain a first coding matrix, and mapping the first coding matrix into a text vector by using the word embedding matrix to obtain the text vector.
And after the forward module converts the original text into the text vector, the text vector is subjected to continuous multiple times of destruction processing.
Converting the restored vector into a text format by using a reverse module to obtain a restored text: mapping the reduction vector according to the word embedding matrix to obtain a second coding matrix; and performing heat-independent decoding processing on the second coding matrix to obtain a restored text, wherein the heat-independent decoding processing is the inverse process of the heat-independent coding processing.
The reverse module continuously restores the damaged vector for a plurality of times and converts the restored vector into a restored text.
The reverse module of the text diffusion model is utilized to continuously restore the damaged vector for a plurality of times to obtain a restored vector corresponding to the original text, and the restored vector is converted into a text format, and before the restored text corresponding to the original text is obtained, the method further comprises: acquiring a training data set, and training a target model by using the training data set, so that the target model can determine and execute recovery processing corresponding to the destruction processing of a forward module on training texts in the training data set, wherein the target model is a U-Net model, a ResNet model or a transformer model; and (3) the trained target model is followed by an algorithm for converting the vector into text, so that a reverse module is obtained.
The object model is used for carrying out recovery processing on the damaged vector for a plurality of times, and an algorithm for converting the vector into text is used for converting the recovered vector into recovered text. The model output head of the target model is followed by an algorithm that converts the vector to text, the reverse module.
After acquiring the image dataset to be data enhanced, the method further comprises: the method comprises the steps of carrying out destruction processing on a text vector and adding noise by utilizing a forward module for a plurality of times to obtain a destruction vector; determining the destruction processing and the added noise of the text vector by the forward module each time by utilizing the target model in the reverse module, and continuously performing recovery processing and noise removal corresponding to the destruction processing on the destruction vector for a plurality of times to obtain a recovery vector corresponding to the original text; the target model is trained to determine and execute the destructive processing of the text vector and the added noise by the forward module.
In order to increase the data enhancement effect, the forward module performs the operation of adding noise while performing the destruction processing on the text vector each time. It should be noted that, in the embodiment of the present disclosure, since noise needs to be added to the forward module on the basis of the corruption process, the training of the reverse module opposite to the forward module is different from the previous embodiment, because the reverse module trained at this time needs to determine and execute the corruption process performed by the forward module on the text vector and the added noise.
The forward module can be understood to perform the corruption process and add noise as follows:
Figure SMS_1
wherein F is t For the t-th destruction process, the destruction process includes pooling operation, blurring operation and masking operation,/->
Figure SMS_2
Is the target image after the t-th destruction process and noise addition, when t is equal to 1,/is the target image after the t-th destruction process and noise addition>
Figure SMS_3
Is a target image, < > when t is equal to N>
Figure SMS_4
Is a second destructive image corresponding to the target image, N is a preset number, < >>
Figure SMS_5
For noise, q () is a target distribution including gaussian distribution, uniform distribution, and t distribution, +.>
Figure SMS_6
Representation->
Figure SMS_7
Satisfying q (), σ is the variance of q ().
The target distribution may be gaussian, uniform, t-distribution, etc.
When training the target model, the following losses are calculated: calculating noise loss between a plurality of noises added by the forward module and a plurality of noises determined by the target model, calculating the damage loss between damage processing performed by the forward module and recovery processing performed by the target model, and calculating text loss between an original text and a recovered text; and updating model parameters of the target model according to the noise loss, the damage loss and the text loss so as to complete training of the target model.
In an alternative embodiment, the text penalty between the original text and the restored text is calculated by a penalty function;
Figure SMS_8
where N is the total number of words in the original text, < >>
Figure SMS_9
For the distribution of the original word list in the position of the i-th word in the original text, ++>
Figure SMS_10
To restore the predicted word list distribution at the position of the i-th word in the text,
Figure SMS_11
is->
Figure SMS_12
And->
Figure SMS_13
The same probability;
calculating noise losses between the plurality of noise added by the forward module and the plurality of noise determined by the target model by using a mean square error loss function; the cross entropy loss function is used to calculate the damage loss between the damage processing by the forward module and the recovery processing by the object model.
The destruction process and the restoration process may be represented by labels, which indicate that the target model detection or prediction is correct if the labels of the destruction process and its corresponding restoration process are identical, and which indicate that the target model detection or prediction is incorrect if the labels of the destruction process and its corresponding restoration process are not identical. This case may use the cross entropy loss function to calculate the damage loss between the tag of the damage process and the recovery process to which the damage process corresponds.
Optionally, the total loss in the training of the target model is calculated using the following formula:
Figure SMS_14
wherein G is t For the t-th recovery processing, F t For the destruction process of the t-th time,G t and F is equal to t Is corresponding to (I)>
Figure SMS_15
Is the text vector after the t-1 th destruction process and noise addition, when t is equal to 1,/is>
Figure SMS_16
Is a text vector, < > when t is equal to N>
Figure SMS_17
Is a destruction vector, N is a preset number, ">
Figure SMS_18
Is noise (I)>
Figure SMS_19
The target distribution is satisfied and the distribution is determined, σ is the variance of the target distribution, I 1 Representing the number of times a norm operation is performed, T is the number of times a total of corruption processing and noise addition is performed, and T is equal in value to T.
The number of times of processing the original text by using the text diffusion model is used for controlling the number of the restored texts corresponding to the original text and controlling the scale of the text data set after data enhancement.
Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.
The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.
Fig. 3 is a schematic diagram of a text data enhancement device based on destructive processing of text according to an embodiment of the present disclosure. As shown in fig. 3, the text data enhancement device based on the destruction processing of text includes:
an acquisition module 301 configured as an acquisition module configured to acquire a text data set to be data-enhanced;
a building module 302 configured to build a text diffusion model, where the text diffusion model includes a forward module and a reverse module, the forward module being built according to a plurality of destruction processes, the reverse module being obtained by performing model training, the forward module and the reverse module implementing opposite operations;
the destruction module 303 is configured to convert the original text in the text data set into a text vector by using a forward module of the text diffusion model, and obtain a destruction vector corresponding to the original text by performing destruction processing on the text vector continuously for a plurality of times;
the restoring module 304 is configured to continuously restore the damaged vector for multiple times by using the reversing module of the text diffusion model to obtain a restored vector corresponding to the original text, and convert the restored vector into a text format to obtain a restored text corresponding to the original text, where each restoring process corresponds to one damaging process, and the restoring process and the damaging process corresponding to each other are mutually reversed processes;
the generation module 305 is configured to generate a data-enhanced text data set using the original text and the restored text.
The conventional text diffusion model has two processes, namely a diffusion process and a back diffusion process. According to the embodiment of the disclosure, the diffusion process is replaced by the constructed forward module, the reverse diffusion process is replaced by the reverse module, the precision of the constructed text diffusion model is higher than that of the traditional text diffusion model, and the text obtained by data enhancement accords with the original text. In the embodiments of the present disclosure, the forward module may be regarded as performing various kinds of destruction processing, and the reverse module may be regarded as a neural network model trained to detect the operation of the forward module and perform the inverse of the operation of the forward module.
According to the technical scheme provided by the embodiment of the disclosure, a text data set to be enhanced by data is obtained; the method comprises the steps of constructing a text diffusion model, wherein the text diffusion model comprises a forward module and a reverse module, the forward module is constructed according to various damage treatments, the reverse module is obtained by model training, and the forward module and the reverse module realize opposite operations; converting an original text in a text data set into a text vector by utilizing a forward module of a text diffusion model, and obtaining a damage vector corresponding to the original text by carrying out continuous and repeated damage processing on the text vector; the reverse module of the text diffusion model is utilized to continuously restore the damaged vector for a plurality of times to obtain a restored vector corresponding to the original text, the restored vector is converted into a text format to obtain a restored text corresponding to the original text, wherein each restoring process corresponds to one damaging process, and the restoring process and the damaging process corresponding to each other are mutually reverse processes; by using the original text and the restored text to generate a text data set with enhanced data, the problem that the text obtained by the traditional data enhancement method deviates from the original text in the prior art can be solved by adopting the technical means, and the text obtained by the data enhancement method accords with the text data distribution of the original text.
A destruction process comprising: pooling operations, blurring operations, and masking operations.
Optionally, the destruction module 303 is further configured to perform a hot-independent encoding process on the original text to obtain a first encoding matrix, and map the first encoding matrix into a text vector by using the word embedding matrix to obtain a text vector.
And after the forward module converts the original text into the text vector, the text vector is subjected to continuous multiple times of destruction processing.
Optionally, the reduction module 304 is further configured to map the reduction vector to obtain a second encoding matrix according to the word embedding matrix; and performing heat-independent decoding processing on the second coding matrix to obtain a restored text, wherein the heat-independent decoding processing is the inverse process of the heat-independent coding processing.
The reverse module continuously restores the damaged vector for a plurality of times and converts the restored vector into a restored text.
Optionally, the restoration module 304 is further configured to acquire a training data set, and train the target model with the training data set, so that the target model can determine and execute restoration processing corresponding to the destruction processing of the training text in the training data set by the forward module, where the target model is a U-Net model, a res Net model, or a transformer model; and (3) the trained target model is followed by an algorithm for converting the vector into text, so that a reverse module is obtained.
The object model is used for carrying out recovery processing on the damaged vector for a plurality of times, and an algorithm for converting the vector into text is used for converting the recovered vector into recovered text. The model output head of the target model is followed by an algorithm that converts the vector to text, the reverse module.
Optionally, the breaking module 303 is further configured to perform breaking processing and adding noise to the text vector multiple times in succession by using the forward module, resulting in a broken vector.
Optionally, the restoration module 304 is further configured to determine, by using the target model in the reverse module, a destruction process and added noise performed on the text vector by the forward module each time, and perform, in succession, a restoration process and noise removal corresponding to the destruction process on the destroyed vector multiple times, to obtain a restoration vector corresponding to the original text; the target model is trained to determine and execute the destructive processing of the text vector and the added noise by the forward module.
In order to increase the data enhancement effect, the forward module performs the operation of adding noise while performing the destruction processing on the text vector each time. It should be noted that, in the embodiment of the present disclosure, since noise needs to be added to the forward module on the basis of the corruption process, the training of the reverse module opposite to the forward module is different from the previous embodiment, because the reverse module trained at this time needs to determine and execute the corruption process performed by the forward module on the text vector and the added noise.
The forward module can be understood to perform the corruption process and add noise as follows:
Figure SMS_20
wherein F is t For the t-th destruction process, the destruction process includes pooling operation, blurring operation and masking operation,/->
Figure SMS_21
Is the target image after the t-th destruction process and noise addition, when t is equal to 1,/is the target image after the t-th destruction process and noise addition>
Figure SMS_22
Is a target image, < > when t is equal to N>
Figure SMS_23
Is a second destructive image corresponding to the target image, N is a preset number, < >>
Figure SMS_24
For noise, q () is a target distribution including gaussian distribution, uniform distribution, and t distribution, +.>
Figure SMS_25
Representation->
Figure SMS_26
Satisfying q (), σ is the variance of q ().
The target distribution may be gaussian, uniform, t-distribution, etc.
Optionally, the restoration module 304 is further configured to calculate a noise loss between the plurality of noises added by the forward module and the plurality of noises determined by the target model, calculate a damage loss between the damage processing performed by the forward module and the restoration processing performed by the target model, and calculate a text loss between the original text and the restored text; and updating model parameters of the target model according to the noise loss, the damage loss and the text loss so as to complete training of the target model.
Optionally, the reduction module 304 is further configured to calculate text loss between the original text and the reduced text by a loss function as follows;
Figure SMS_27
where N is the total number of words in the original text, < >>
Figure SMS_28
For the distribution of the original word list in the position of the i-th word in the original text, ++>
Figure SMS_29
To restore the predicted word list distribution at the position of the i-th word in the text,
Figure SMS_30
is->
Figure SMS_31
And->
Figure SMS_32
The same probability.
Optionally, the restoration module 304 is further configured to calculate a noise loss between the plurality of noise added by the forward module and the plurality of noise determined by the target model using a mean square error loss function; the cross entropy loss function is used to calculate the damage loss between the damage processing by the forward module and the recovery processing by the object model.
The destruction process and the restoration process may be represented by labels, which indicate that the target model detection or prediction is correct if the labels of the destruction process and its corresponding restoration process are identical, and which indicate that the target model detection or prediction is incorrect if the labels of the destruction process and its corresponding restoration process are not identical. This case may use the cross entropy loss function to calculate the damage loss between the tag of the damage process and the recovery process to which the damage process corresponds.
Optionally, the restoration module 304 is further configured to calculate the total loss in the training of the target model using the following formula:
Figure SMS_33
wherein G is t For the t-th recovery processing, F t For the t-th destruction treatment, G t And F is equal to t Is corresponding to (I)>
Figure SMS_34
Is the text vector after the t-1 th destruction process and noise addition, when t is equal to 1,/is>
Figure SMS_35
Is a text vector, < > when t is equal to N>
Figure SMS_36
Is a destruction vector, N is a preset number, ">
Figure SMS_37
Is noise (I)>
Figure SMS_38
The target distribution is satisfied and the distribution is determined, σ is the variance of the target distribution, I 1 Representing the number of times a norm operation is performed, T is the number of times a total of corruption processing and noise addition is performed, and T is equal in value to T.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.
Fig. 4 is a schematic diagram of an electronic device 4 provided by an embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented by processor 401 when executing computer program 403. Alternatively, the processor 401, when executing the computer program 403, performs the functions of the modules/units in the above-described apparatus embodiments.
The electronic device 4 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not limiting of the electronic device 4 and may include more or fewer components than shown, or different components.
The processor 401 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
The memory 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 4. Memory 402 may also include both internal storage units and external storage devices of electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims (10)

1. A text data enhancement method based on destructive processing of text, comprising:
acquiring a text data set to be enhanced by data;
constructing a text diffusion model, wherein the text diffusion model comprises a forward module and a reverse module, the forward module is constructed according to various damage treatments, the reverse module is obtained by model training, and the forward module and the reverse module realize opposite operations;
converting an original text in the text data set into a text vector by utilizing a forward module of the text diffusion model, and obtaining a damage vector corresponding to the original text by carrying out damage processing on the text vector continuously for a plurality of times;
the reverse module of the text diffusion model is utilized to continuously restore the damaged vector for a plurality of times to obtain a restored vector corresponding to the original text, the restored vector is converted into a text format to obtain a restored text corresponding to the original text, each restoring process corresponds to one damage process, and the restoring process and the damage process which correspond to each other are mutually inverse processes;
and generating the text data set with the enhanced data by using the original text and the restored text.
2. The method of claim 1, wherein the destruction process comprises: pooling operations, blurring operations, and masking operations.
3. The method according to claim 1, characterized in that it comprises:
converting the original text into the text vector using the forward module: performing hot independent coding processing on the original text to obtain a first coding matrix, and mapping the first coding matrix into a text vector by utilizing a word embedding matrix to obtain the text vector;
converting the restored vector into a text format by using the reversing module to obtain the restored text: mapping the reduction vector according to the word embedding matrix to obtain a second coding matrix; and performing heat independent decoding processing on the second coding matrix to obtain the restored text, wherein the heat independent decoding processing is the inverse process of the heat independent coding processing.
4. The method according to claim 1, wherein the restoring process is performed on the destruction vector by using an inversion module of the text diffusion model multiple times in succession, so as to obtain a restored vector corresponding to the original text, and the restoring vector is converted into a text format, so that before the restored text corresponding to the original text is obtained, the method further includes:
acquiring a training data set, and training a target model by using the training data set, so that the target model can determine and execute the recovery processing corresponding to the destruction processing of training texts in the training data set by the forward module, wherein the target model is a U-Net model, a ResNet model or a transformer model;
and the trained target model is followed by an algorithm for converting the vector into text, so that the reverse module is obtained.
5. The method of claim 1, wherein after the acquiring the image dataset to be data enhanced, the method further comprises:
carrying out destruction processing and noise addition on the text vector for a plurality of times by utilizing the forward module to obtain the destruction vector;
determining the destruction processing and the added noise of the text vector by the forward module each time by utilizing a target model in the reverse module, and continuously performing the recovery processing and the noise removal corresponding to the destruction processing on the destruction vector for a plurality of times to obtain a recovery vector corresponding to the original text;
the object model has been trained to determine and perform the corruption process and added noise of the text vector by the forward module.
6. The method of claim 5, wherein the following losses are calculated when training the target model:
calculating noise loss between a plurality of noise added by the forward module and a plurality of noise determined by the target model, calculating damage loss between the damage processing performed by the forward module and the recovery processing performed by the target model, and calculating text loss between the original text and the recovery text;
and updating model parameters of the target model according to the noise loss, the damage loss and the text loss so as to complete training of the target model.
7. The method according to claim 6, comprising:
calculating text loss between the original text and the restored text by a loss function as follows;
Figure QLYQS_1
where N is the total number of words in the original text,
Figure QLYQS_2
for the distribution of the original word list at the position of the i-th word in said original text,/for the distribution of the original word list at the position of the i-th word in said original text>
Figure QLYQS_3
For the predicted word list distribution at the position of the i-th word in said reduced text,/for the predicted word list distribution at the position of the i-th word in said reduced text>
Figure QLYQS_4
Is->
Figure QLYQS_5
And->
Figure QLYQS_6
The same probability;
calculating noise losses between the plurality of noise added by the forward module and the plurality of noise determined by the target model by using a mean square error loss function;
a cross entropy loss function is used to calculate a loss of corruption between the corruption process by the forward module and the restoration process by the target model.
8. A text data enhancement device based on a destruction process of a text, comprising:
an acquisition module configured to acquire a text data set to be data enhanced;
a building module configured to build a text diffusion model, wherein the text diffusion model comprises a forward module and a reverse module, the forward module is built according to various damage processes, the reverse module is obtained by performing model training, and the forward module and the reverse module realize opposite operations;
the destruction module is configured to convert the original text in the text data set into a text vector by utilizing the forward module of the text diffusion model, and obtain a destruction vector corresponding to the original text by carrying out destruction processing on the text vector continuously for a plurality of times;
the restoring module is configured to utilize the reversing module of the text diffusion model to continuously restore the damaged vector for a plurality of times to obtain a restored vector corresponding to the original text, and convert the restored vector into a text format to obtain a restored text corresponding to the original text, wherein each restoring process corresponds to one damage process, and the restoring process and the damage process corresponding to each other are inverse processes;
a generation module configured to generate the data-enhanced text dataset using the original text and the restored text.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.
CN202310364625.5A 2023-04-07 2023-04-07 Text data enhancement method and device based on destruction processing of text Active CN116127925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310364625.5A CN116127925B (en) 2023-04-07 2023-04-07 Text data enhancement method and device based on destruction processing of text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310364625.5A CN116127925B (en) 2023-04-07 2023-04-07 Text data enhancement method and device based on destruction processing of text

Publications (2)

Publication Number Publication Date
CN116127925A true CN116127925A (en) 2023-05-16
CN116127925B CN116127925B (en) 2023-08-29

Family

ID=86310312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310364625.5A Active CN116127925B (en) 2023-04-07 2023-04-07 Text data enhancement method and device based on destruction processing of text

Country Status (1)

Country Link
CN (1) CN116127925B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312833A (en) * 2023-11-29 2023-12-29 北京冠群信息技术股份有限公司 Data identification method and system applied to digital asset environment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398890A (en) * 2022-01-19 2022-04-26 中国平安人寿保险股份有限公司 Text enhancement method, device, equipment and storage medium
CN114417794A (en) * 2022-03-29 2022-04-29 北京大学 Training method and device for scale problem generation model and computer equipment
CN114861600A (en) * 2022-07-07 2022-08-05 之江实验室 NER-oriented Chinese clinical text data enhancement method and device
CN115563335A (en) * 2022-09-28 2023-01-03 深圳市欢太科技有限公司 Model training method, image-text data processing device, image-text data processing equipment and image-text data processing medium
CN115563281A (en) * 2022-10-13 2023-01-03 深圳须弥云图空间科技有限公司 Text classification method and device based on text data enhancement
US20230067841A1 (en) * 2021-08-02 2023-03-02 Google Llc Image Enhancement via Iterative Refinement based on Machine Learning Models

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230067841A1 (en) * 2021-08-02 2023-03-02 Google Llc Image Enhancement via Iterative Refinement based on Machine Learning Models
CN114398890A (en) * 2022-01-19 2022-04-26 中国平安人寿保险股份有限公司 Text enhancement method, device, equipment and storage medium
CN114417794A (en) * 2022-03-29 2022-04-29 北京大学 Training method and device for scale problem generation model and computer equipment
CN114861600A (en) * 2022-07-07 2022-08-05 之江实验室 NER-oriented Chinese clinical text data enhancement method and device
CN115563335A (en) * 2022-09-28 2023-01-03 深圳市欢太科技有限公司 Model training method, image-text data processing device, image-text data processing equipment and image-text data processing medium
CN115563281A (en) * 2022-10-13 2023-01-03 深圳须弥云图空间科技有限公司 Text classification method and device based on text data enhancement

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李井辉;孙丽娜;李晶;: "基于LSTM的评论文本情感分析方法研究", 微型电脑应用, no. 05, pages 5 - 8 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312833A (en) * 2023-11-29 2023-12-29 北京冠群信息技术股份有限公司 Data identification method and system applied to digital asset environment
CN117312833B (en) * 2023-11-29 2024-02-27 北京冠群信息技术股份有限公司 Data identification method and system applied to digital asset environment

Also Published As

Publication number Publication date
CN116127925B (en) 2023-08-29

Similar Documents

Publication Publication Date Title
US10650102B2 (en) Method and apparatus for generating parallel text in same language
CN110377740B (en) Emotion polarity analysis method and device, electronic equipment and storage medium
CN109829164B (en) Method and device for generating text
CN111950692B (en) Robust output coding based on hamming distance for improved generalization
CN112270200B (en) Text information translation method and device, electronic equipment and storage medium
CN113947095B (en) Multilingual text translation method, multilingual text translation device, computer equipment and storage medium
CN111368551A (en) Method and device for determining event subject
CN116127925B (en) Text data enhancement method and device based on destruction processing of text
CN111915086A (en) Abnormal user prediction method and equipment
CN113779277A (en) Method and device for generating text
CN114358023B (en) Intelligent question-answer recall method, intelligent question-answer recall device, computer equipment and storage medium
CN117114063A (en) Method for training a generative large language model and for processing image tasks
CN113408507B (en) Named entity identification method and device based on resume file and electronic equipment
CN114780701A (en) Automatic question-answer matching method, device, computer equipment and storage medium
CN112307738B (en) Method and device for processing text
CN110852057A (en) Method and device for calculating text similarity
CN116108810A (en) Text data enhancement method and device
CN117252250A (en) Large model pre-training method and device
CN110929512A (en) Data enhancement method and device
CN111611420B (en) Method and device for generating image description information
CN111709784B (en) Method, apparatus, device and medium for generating user retention time
CN111784377B (en) Method and device for generating information
CN114792086A (en) Information extraction method, device, equipment and medium supporting text cross coverage
CN107609645B (en) Method and apparatus for training convolutional neural network
CN111475618A (en) Method and apparatus for generating information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant