CN113886885A

CN113886885A - Data desensitization method, data desensitization device, equipment and storage medium

Info

Publication number: CN113886885A
Application number: CN202111229481.XA
Authority: CN
Inventors: 郑旭如; 赵盟盟; 王磊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-10-21
Filing date: 2021-10-21
Publication date: 2022-01-04
Also published as: WO2023065632A1

Abstract

The application relates to the field of artificial intelligence, in particular to a data desensitization method, a data desensitization device, equipment and a storage medium, wherein the method comprises the following steps: acquiring user data, and performing information identification on the user data based on a pre-trained key information identification model to obtain key information; preprocessing the key information to obtain discrete variables corresponding to the key information, wherein the preprocessing comprises data discretization processing or data normalization processing; based on a conditional loss function, carrying out conditional random sampling processing on the discrete variable to obtain a conditional embedded vector and an implicit vector, and splicing the conditional embedded vector and the implicit vector to obtain a spliced vector; and inputting the splicing vector into a pre-trained generator for desensitization treatment to obtain desensitization data. Therefore, desensitization data cannot be reversely cracked easily, privacy data are prevented from being leaked, and safety of the desensitization data is improved.

Description

Data desensitization method, data desensitization device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a data desensitization method, a data desensitization apparatus, a computer device, and a storage medium.

Background

In the big data era, the frequency of attacks on data is increasing day by day, and the attack modes are becoming rich. Data desensitization is an effective approach to address data security issues and risks. Data desensitization refers to data deformation of key information or personal information according to preset rules or transformation, so that the key information cannot be identified or hidden by the personal identity. The desensitization mode of the structured data which is common at present is based on an anonymization technology or a scrambling technology.

In the structural data desensitization method based on the anonymization technology or the scrambling technology, the desensitized data and the original data have a one-to-one mapping relationship, so that the desensitized data is easy to reverse crack, the original data is easily restored, and further, the privacy information in the original data is leaked, and the data security is poor.

Disclosure of Invention

The application provides a data desensitization method, a data desensitization device, computer equipment and a storage medium, and aims to solve the problem that privacy information is easily revealed due to the fact that an existing desensitization mode is easily reverse cracked.

To achieve the above object, the present application provides a data desensitization method, including:

acquiring user data, and performing information identification on the user data based on a pre-trained key information identification model to obtain key information;

preprocessing the key information to obtain discrete variables corresponding to the key information, wherein the preprocessing comprises data discretization processing or data normalization processing;

based on a conditional loss function, carrying out conditional random sampling processing on the discrete variable to obtain a conditional embedded vector and an implicit vector, and splicing the conditional embedded vector and the implicit vector to obtain a spliced vector;

and inputting the splicing vector into a pre-trained generator for desensitization treatment to obtain desensitization data.

To achieve the above object, the present application also provides a data desensitization apparatus, including:

the key information extraction module is used for acquiring user data and carrying out information identification on the user data based on a pre-trained key information identification model to obtain key information;

the information processing module is used for preprocessing the key information to obtain a discrete variable corresponding to the key information, and the preprocessing comprises data discretization processing or data normalization processing;

the vector splicing module is used for carrying out conditional random sampling processing on the discrete variable based on a conditional loss function to obtain a conditional embedded vector and an implicit vector, and splicing the conditional embedded vector and the implicit vector to obtain a spliced vector;

and the data desensitization module is used for inputting the splicing vector into a pre-trained generator for desensitization treatment to obtain desensitization data.

In addition, to achieve the above object, the present application also provides a computer device comprising a memory and a processor; the memory for storing a computer program; the processor is configured to execute the computer program and implement the data desensitization method according to any one of the embodiments of the present application when executing the computer program.

In addition, to achieve the above object, the present application also provides a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to implement the data desensitization method according to any one of the embodiments of the present application.

The data desensitization method, the data desensitization device, the equipment and the storage medium disclosed by the embodiment of the application generate the splicing vector by extracting the key information of the user data and the discrete variable of the key information, perform desensitization processing on the splicing vector by using the pre-trained generator to obtain desensitization data, so that the desensitization data cannot be easily cracked reversely, the privacy data are prevented from being leaked, and the safety of the desensitization data is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of a data desensitization method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a data desensitization method according to an embodiment of the present application;

FIG. 3 is a schematic block diagram of a data desensitization apparatus provided in an embodiment of the present application;

fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation. In addition, although the division of the functional blocks is made in the device diagram, in some cases, it may be divided in blocks different from those in the device diagram.

The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Data desensitization is a data processing technique that can reduce or remove the sensitivity of data by processing the data. By adopting a data desensitization technology, the risk and harm of data leakage can be reduced, and the privacy of user data is effectively protected. In the internet or medical field, users can store, check and share personal medical health data through a personal digital space, but the personal medical data can face the risk of leakage of user medical sensitive information in the processes of online medical observation, online medicine purchase, clinic appointment and the like, and the data of the users in the medical industry has extremely high authenticity and sensitivity, and once the personal sensitive information of the users is leaked, potential life threat can be caused to the users. With data desensitization, information in the personal digital space can be used for business related analysis and processing while avoiding leakage of user data.

The desensitization mode of the structured data which is common at present is based on an anonymization technology or a scrambling technology. Common anonymization technologies include k-anonymization, l-diversity, t-closeness and the like, which achieve the desensitization effect by generalizing a quasi-identifier of a single record so that the data cannot be distinguished in the whole data set. The scrambling technique is based on adding noise to the recordings, for example adding additive or multiplicative noise to successive values, to achieve a desensitizing effect.

In the structural data desensitization method based on the anonymization technology or the scrambling technology, the desensitized data and the original data have a one-to-one mapping relationship, so that the desensitized data has a risk of being reversed, and the desensitized data is usually different from the original data greatly and loses research value.

In order to solve the problems, the data desensitization method can be applied to a server, specifically applied to multiple fields such as finance and medical treatment, and comprises the steps of continuously carrying out iterative updating on generator parameters to obtain a pre-trained generator, extracting sensitive information of user data, carrying out desensitization processing on the sensitive information by using the pre-trained generator to obtain desensitization data, so that the desensitization data cannot be easily and reversely cracked, privacy data are guaranteed not to be leaked, and safety of the desensitization data is improved.

The server may be, for example, a single server or a server cluster. However, for ease of understanding, the following embodiments will be described in detail with respect to a data desensitization method applied to a server.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

As shown in fig. 1, the data desensitization method provided in the embodiment of the present application may be applied to an application environment shown in fig. 1. The application environment includes a terminal device 110 and a server 120, wherein the terminal device 110 can communicate with the server 120 through a network. Specifically, the server 120 obtains the user data sent by the terminal device 110, and the server 120 performs key information extraction, information processing, and desensitization processing on the user data to generate desensitization data, and sends the desensitization data to the terminal device 110, so as to implement data desensitization processing. The server 120 may be an independent server, or may be a cloud server that provides basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal device 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Referring to fig. 2, fig. 2 is a schematic flow chart of a data desensitization method according to an embodiment of the present application. The data desensitization method can be applied to a server, so that desensitization data cannot be easily and reversely cracked, privacy data are prevented from being leaked, and safety of the desensitization data is improved.

As shown in fig. 2, the data desensitization method includes steps S101 to S104.

S101, obtaining user data, and carrying out information recognition on the user data based on a pre-trained key information recognition model to obtain key information.

The user data is data containing key information, and may specifically include medical data such as medical record data, financial data such as bank account data, and the like. The key information identification model may be a pre-trained BERT-CRF model based on an attention mechanism for extracting key information in user data. The key information is information that the user needs to desensitize, and is generally privacy information of the user, for example, the key information may be height and weight information in medical record data, or account balance information and investment information in bank account data. It should be noted that all sensitive information or private information can be used as the key information.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In some embodiments, the user data is subjected to word segmentation processing to obtain a plurality of words; extracting features of each word segmentation to obtain embedded features of each word segmentation; performing word sense prediction according to the embedding characteristics of each participle to obtain a word sense corresponding to each participle; and screening the multiple participles according to the word senses corresponding to the participles to obtain key information. Therefore, the key information can be accurately extracted, and the accuracy and safety of desensitization data generation are improved.

Wherein the embedding features are word embedding features, position embedding features and segmentation embedding features. The word embedding features are vector representations of each word segmentation, the position embedding is vector representation of each word segmentation position, and the segmentation embedding features are used for distinguishing two different words.

Specifically, the word segmentation process may be performed on the user data based on a word segmentation algorithm to obtain a plurality of words, where the word segmentation algorithm may be an algorithm such as a forward maximum matching method, a reverse maximum matching method, a word segmentation algorithm based on a hidden markov model, or a word segmentation algorithm based on a conditional random field.

For example, a segmentation algorithm based on a hidden markov model can perform segmentation on a medical record text of user data such as medical record data to obtain a plurality of corresponding segments such as frequency of urination, hungry tendency, anxiety, tremor and the like, wherein the medical record text of the patient data indicates that the patient has symptoms such as frequent urination, hungry tendency, anxiety, tremor and the like, and is suspected to be diabetic.

Specifically, feature extraction may be performed on each of the segmented words to obtain an embedding feature of each of the segmented words, word sense prediction may be performed on each of the segmented words according to the embedding feature of each of the segmented words based on a word sense prediction model to obtain a word sense prediction result corresponding to each of the segmented words, and the plurality of segmented words may be screened based on the word sense prediction result corresponding to each of the segmented words to obtain the key information. Therefore, text features can be mined to the maximum extent, richer Word expressions are extracted, and the defects that context information cannot be dynamically expressed and Word ambiguity cannot be solved in traditional Word vectors such as Word2vec, Glove and the like are overcome. Therefore, the similarity between each participle and the preset standard sensitive participle can be quickly obtained, and the corresponding key information can be quickly obtained.

The word meaning prediction model is used for predicting the similarity degree of each participle and a preset standard sensitive participle, the word meaning prediction model is obtained by training a semantic matching model and a standard sensitive participle database, the semantic prediction model can comprise models such as an LSTM matching model, an MV-DSSM model and an ESIM model, and the word meaning prediction result is the similarity degree of each participle and the standard sensitive participle in the standard sensitive participle database.

For example, the participles include participles such as account balance of account information and participles such as stock trend information, feature extraction may be performed on each of the participles to obtain a word embedding feature, a position embedding feature and a segmentation embedding feature of each of the participles, word sense prediction may be performed on each of the participles according to the word embedding feature, the position embedding feature and the segmentation embedding feature based on an LSTM matching model to obtain a word sense prediction result corresponding to each of the participles, and the participles corresponding to the stock trend information may be screened out based on the word sense prediction result corresponding to each of the participles to obtain key information.

S102, preprocessing the key information to obtain discrete variables corresponding to the key information, wherein the preprocessing comprises data discretization processing or data normalization processing.

Since the key information is generally continuous data, the representation conversion, i.e. the data preprocessing operation, is required to be performed on the continuous data and the discrete data, which is a key step of the input and output of the neural network.

Illustratively, when the key information is information such as height and weight, the key information is continuous data, and when the key information is information such as the number of investment enterprises, the key information is discrete data.

The discrete variable refers to variables whose values can be listed in a certain order, and is usually a whole number of variables, such as the number of workers, the number of factories, the number of machines, and so on. Specifically, the data normalization processing may include maximum and minimum normalization processing and normalization processing according to a gaussian mixture model; the data discretization process may include a K-bins discretization process and a regression tree discretization process.

In some embodiments, the maximum and minimum normalization processing is performed on the key information to obtain a discrete variable corresponding to the key information; or, performing normalization processing on the key information through a Gaussian mixture model to obtain discrete variables corresponding to the key information; or performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or performing discretization processing on the key information to obtain discrete variables corresponding to the key information.

Specifically, if the key information is continuous data, the key information may be mapped into a range of [0,1] through a maximum-minimum linear transformation, so that the continuous value may be represented by using a tanh activation function, and a discrete variable corresponding to the key information is obtained.

Specifically, if the key information is continuous data, the key information may be fitted through a gaussian mixture model, a gaussian component is sampled according to the probability of the key information in the gaussian component of the mixture model, and the sampled gaussian component is used to perform normalized representation on the key information in the record. Then, the key information may be composed of the normalized representation and the one-hot encoding of the gaussian component, so as to obtain the discrete variable corresponding to the key information.

Specifically, if the key information is continuous data, K-bins discretization processing may be performed on the key information to obtain a discrete variable corresponding to the key information. The discretization, which may also be referred to as binning, is to divide the key information into each interval according to a certain rule, and express each interval by using a unique hot code, so as to fit the key information by using a piecewise function including four intervals, thereby obtaining a discrete variable corresponding to the key information.

Specifically, if the key information is continuous data, the CART regression tree may be used to discretize the key information to obtain a discrete variable corresponding to the key information. The CART regression tree can predict continuous data, and leaf nodes of the CART regression tree represent a predicted value. The key information can be converted into discrete values by representing a series of leaf nodes of a regression tree or a regression tree set of the key information through one-hot coding.

If the key information is discrete data, it is not necessary to perform data discretization processing or data normalization processing.

S103, carrying out conditional random sampling processing on the discrete variable based on a conditional loss function to obtain a conditional embedded vector and an implicit vector, and splicing the conditional embedded vector and the implicit vector to obtain a spliced vector.

The conditional loss function is a conditional loss function based on a countermeasure generation network, data items of the conditional loss function based on the countermeasure generation network are generated based on conditional probability, and the purpose is to enable data to be generated conditionally, so that the distribution of data to be desensitized of the same type and the generated desensitized data is consistent as much as possible. However, since the condition sampled every time may be a different variable, it is difficult to sufficiently train the data based on the condition variable, and it is observed that the value of the corresponding variable in the generated data is inconsistent with the value of the condition variable. The training process can be constrained by predicting the condition variables, so that the values of the condition variables are consistent with the values of the corresponding variables in the generated data, and the effect of data generation can be further optimized.

Specifically, the conditional embedding vector may randomly and equiprobabilistically select a discrete variable meeting a preset condition from a plurality of discrete variables corresponding to the key information, the hidden vector may be obtained from white noise samples corresponding to the key information, and the concatenation vector is obtained by concatenating the conditional embedding vector and the hidden vector and is used as an input of the generator. By adding the hidden vector, the one-to-one mapping relation between desensitized data and original data is changed, so that the desensitized data is not easy to reverse-decompose, and privacy information is obtained.

Specifically, a distributed representation of the discrete variable can be obtained by constructing a probability mass distribution function of each value under the discrete variable, and conditional random sampling processing is performed on the distributed representation of the discrete variable to obtain a conditional embedded vector and a hidden vector.

Illustratively, white noise corresponding to a discrete variable may be converted by a deep neural network to generate a hidden vector from the distributed representation of the discrete variable.

In some embodiments, the conditional embedded vector is transformed to obtain a one-hot code; and splicing the one-hot code and the hidden vector to obtain a spliced vector. The One-Hot coding, i.e., One-Hot coding, also called active coding, uses an N-bit status register to code N states, each state being represented by its own independent register bit and only One of which is active at any time. The condition embedded vector is converted into the one-hot code, so that the problem that the attribute data is not well processed by the discriminator can be solved, and the function of expanding the vector characteristics is achieved to a certain extent.

Specifically, the conditional embedded vector can be converted through a deep neural network to obtain the one-hot code, and the one-hot code and the hidden vector are spliced to obtain the spliced vector. Therefore, the splicing vector meeting the input requirement of the generator can be obtained.

And S104, inputting the splicing vector into a pre-trained generator for desensitization treatment to obtain desensitization data.

The pre-trained generator is generated based on confrontation generation network training, and the desensitization data is data obtained after desensitization is carried out on key information in the data to be desensitized.

In some embodiments, a splicing vector corresponding to training data is obtained, and the splicing vector is input to a first generator for desensitization treatment to obtain desensitized data; training a preset discriminator based on the desensitized data and the training data to obtain a pre-trained discriminator; and according to a preset learning rate and the parameters of the pre-trained discriminator, carrying out repeated iteration updating on the parameters of the first generator to obtain a second generator, and using the second generator as the pre-trained generator. Therefore, the parameters of the first generator can be updated repeatedly for many times through the pre-trained discriminator and desensitized data, and very real desensitized data can be generated. The reason that the pre-trained discriminator needs to be trained first and then the generator needs to be trained is that the generator needs to be trained first, so that the parameters of the generator can be updated more accurately after the data to be desensitized and the generated desensitized data can be distinguished well.

The training data is a data set to be desensitized for training generator parameters, the first generator is a preset generator which is not trained, and the second generator is generated by the first generator through multiple iterative updates. Wherein the parameters of the first generator and the second generator are different. The prior probability of the discrete variable can be obtained by a distributed representation of the discrete variable, and the parameter is sampled from the prior probability as the parameter of the first generator. Specifically, the generator and the discriminator may be trained by a random gradient hamiltonian monte carlo method to obtain a pre-trained generator and a pre-trained discriminator.

Specifically, a preset discriminator is trained based on the desensitized data and the training data, the pretrained discriminator is obtained by splicing the condition embedding vector with the desensitized data and the training data respectively to obtain first splicing data and second splicing data, the similarity of the first splicing data and the second splicing data is calculated, a loss function is optimized according to the similarity of the first splicing data and the second splicing data, and gradient cutting is performed on the discriminator through the loss function to obtain the pretrained discriminator.

Illustratively, the parameters of the discriminator can be trained through the first generator and preset parameters of the discriminator, and desensitization data is discriminated as false as possible, so that the parameters of the discriminator are adjusted, and the discrimination capability of the discriminator on the data to be desensitized is improved.

Illustratively, the posterior probability of the second generator can be calculated through the prior probability of the first generator and the pre-trained discriminator parameter, so that the desensitization data can be judged as the data to be desensitized by the discriminator as much as possible, thereby adjusting the generator parameter and generating the real desensitization data.

In some embodiments, after obtaining the second generator, performing noise enhancement on the second generator based on a loss function of statistical information to obtain a pre-trained generator, where parameters of the first generator, the generator with updated parameters, and the pre-trained generator are different. The quality of the desensitization data generation and the degree of desensitization can thereby be controlled.

Wherein the statistical information-based loss function may include a mean-based loss function, a variance-based loss function, and the like.

Specifically, gaussian noise, which is an error conforming to a gaussian normal distribution, may be added to the parameters of the second generator, whereby polynomial fitting of a sinusoidal curve may be achieved. And specific values of gaussian noise can be obtained through experiments.

For example, an error term may be introduced into the parameters of the second generator, so as to modify the parameters of the second generator, thereby obtaining a pre-trained generator. Due to the existence of the error term, the generated desensitization data has a certain difference with the original data, but the difference is not large, so that the problem that the desensitization data is usually greatly different from the original data to lose research value can be avoided, and meanwhile, the data cannot be easily reversed.

In some embodiments, after desensitization data is obtained, random sampling processing is performed on discrete variables of the desensitization data to obtain target discrete variables; predicting the target discrete variable according to the rest discrete variables of the desensitization data based on a logistic regression model to obtain a prediction result of the target discrete variable; adjusting parameters of the pre-trained generator based on the prediction of the target discrete variable. Therefore, parameters of the generator can be adjusted by predicting the discrete variables, and a better desensitization effect is achieved. The better desensitization effect means that the desensitized data can be prevented from being reversely cracked, and meanwhile, the correlation with the original data can be maintained.

The target discrete variable is obtained by randomly sampling from a plurality of discrete variables of desensitization data, and meanwhile, in order to correlate the desensitization data with the original data, the target discrete variable is generally considered to be unchanged, and the desensitization data has a small difference from the original data, so that the research value is not lost, and therefore, the consistency of the target discrete variable needs to be ensured. The logistic regression model is used for predicting discrete variables.

Specifically, the cross entropy loss function may be utilized to determine whether the generated prediction result of the target discrete variable is consistent with the target discrete variable, thereby determining the generation quality of the desensitization data. If the prediction result of the target discrete variable is consistent with the target discrete variable, the parameters of the pre-trained generator do not need to be adjusted; and if the prediction result of the target discrete variable is inconsistent with the target discrete variable, determining the difference value between the prediction result of the target discrete variable and the target discrete variable, and adjusting the parameters of the pre-trained generator according to the difference value. Therefore, the accuracy of the target discrete variable can be determined, and the phenomenon that the generated desensitization data cause overlarge difference of original data is avoided. Since most of the discrete variables of the desensitization data and the original data are identical, one of the discrete variables is removed and can be accurately predicted from the remaining discrete variables.

Illustratively, if the target discrete variable of the desensitization data is the shoe size of 43 yards, based on a logistic regression model, the target discrete variable can be predicted through the rest discrete variables of the desensitization data, such as height, weight and the like, to obtain a prediction result of the shoe size, and whether the generated prediction result of the shoe size is consistent with the shoe size of the desensitization data is judged, for example, if the generated prediction result of the shoe size is the shoe size of 40 yards, a difference value of 1 yard is determined, and parameters of the pre-trained generator are continuously updated iteratively according to the difference value; if the prediction result of the generated shoe size is that the shoe size is 43, the parameters of the pre-trained generator do not need to be adjusted.

In some embodiments, the server may further send prompt information to the terminal device for prompting the user that desensitization data has been generated.

The prompting information may specifically include an application program (APP) or email, a short message, and a chat tool, such as WeChat, qq, and the like.

Illustratively, when desensitization data has been generated, the server may send a prompt message to the terminal device to remind the user that desensitization data has been generated.

Referring to fig. 3, fig. 3 is a schematic block diagram of a data desensitization apparatus according to an embodiment of the present application, where the data desensitization apparatus may be configured in a server for executing the data desensitization method.

As shown in fig. 3, the data desensitization apparatus 200 includes: a key information extraction module 201, an information processing module 202, a vector splicing module 203 and a data desensitization module 204.

The key information extraction module 201 is configured to obtain user data, and perform information recognition on the user data based on a pre-trained key information recognition model to obtain key information;

the information processing module 202 is configured to perform preprocessing on the key information to obtain a discrete variable corresponding to the key information, where the preprocessing includes data discretization processing or data normalization processing;

the vector splicing module 203 is configured to perform conditional random sampling processing on the discrete variable based on a conditional loss function to obtain a conditional embedded vector and an implicit vector, and splice the conditional embedded vector and the implicit vector to obtain a spliced vector;

the data desensitization module 204 is used for inputting the splicing vector into a pre-trained generator for desensitization treatment to obtain desensitization data;

the feature extraction module 201 is further configured to perform word segmentation processing on the user data to obtain a plurality of words; extracting features of each word segmentation to obtain embedded features of each word segmentation; performing word sense prediction according to the embedding characteristics of each participle to obtain a word sense corresponding to each participle; and screening the multiple participles according to the word senses corresponding to the participles to obtain key information.

The information processing module 202 is further configured to perform maximum and minimum normalization processing on the key information to obtain a discrete variable corresponding to the key information; or, performing normalization processing on the key information through a Gaussian mixture model to obtain discrete variables corresponding to the key information; or performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or performing discretization processing on the key information to obtain discrete variables corresponding to the key information.

The vector splicing module 203 is further configured to perform conversion processing on the conditional embedded vector to obtain a one-hot code; and splicing the one-hot code and the hidden vector to obtain a spliced vector.

The generator training module 205 is configured to obtain a splicing vector corresponding to training data, and input the splicing vector to the first generator for desensitization processing to obtain desensitized data; training a preset discriminator based on the desensitized data and the training data to obtain a pre-trained discriminator; and according to a preset learning rate and the parameters of the pre-trained discriminator, carrying out repeated iteration updating on the parameters of the first generator to obtain a second generator, and using the second generator as the pre-trained generator.

The generator training module 205 is further configured to perform noise enhancement processing on the second generator based on a loss function of statistical information to obtain a pre-trained generator, where parameters of the first generator, the second generator, and the pre-trained generator are different.

The generator training module 205 is further configured to perform random sampling processing on the discrete variable of the desensitization data to obtain a target discrete variable; predicting the target discrete variable according to the rest discrete variables of the desensitization data based on a logistic regression model to obtain a prediction result of the target discrete variable; adjusting parameters of the pre-trained generator based on the prediction of the target discrete variable.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus, the modules and the units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The methods, apparatus, and devices of the present application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

For example, the method and apparatus described above may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 4.

Referring to fig. 4, fig. 4 is a schematic diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server.

As shown in fig. 4, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a volatile storage medium, a non-volatile storage medium, and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause a processor to perform any of the data desensitization methods.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by the processor, causes the processor to perform any of a variety of data desensitization methods.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the configuration of the computer apparatus is merely a block diagram of a portion of the configuration associated with aspects of the present application and is not intended to limit the computer apparatus to which aspects of the present application may be applied, and that a particular computer apparatus may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in some embodiments, the processor is configured to execute a computer program stored in the memory to implement the steps of: acquiring user data, and performing information identification on the user data based on a pre-trained key information identification model to obtain key information; preprocessing the key information to obtain discrete variables corresponding to the key information, wherein the preprocessing comprises data discretization processing or data normalization processing; based on a conditional loss function, carrying out conditional random sampling processing on the discrete variable to obtain a conditional embedded vector and an implicit vector, and splicing the conditional embedded vector and the implicit vector to obtain a spliced vector; and inputting the splicing vector into a pre-trained generator for desensitization treatment to obtain desensitization data.

In some embodiments, the processor is further configured to: performing word segmentation processing on the user data to obtain a plurality of words; extracting features of each word segmentation to obtain embedded features of each word segmentation; performing word sense prediction according to the embedding characteristics of each participle to obtain a word sense corresponding to each participle; and screening the multiple participles according to the word senses corresponding to the participles to obtain key information.

In some embodiments, the processor is further configured to: performing maximum and minimum normalization processing on the key information to obtain a discrete variable corresponding to the key information; or, performing normalization processing on the key information through a Gaussian mixture model to obtain discrete variables corresponding to the key information; or performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or performing discretization processing on the key information to obtain discrete variables corresponding to the key information.

In some embodiments, the processor is further configured to: converting the condition embedded vector to obtain a one-hot code; and splicing the one-hot code and the hidden vector to obtain a spliced vector.

In some embodiments, the processor is further configured to: acquiring a splicing vector corresponding to training data, and inputting the splicing vector into a first generator for desensitization treatment to obtain desensitized data; training a preset discriminator based on the desensitized data and the training data to obtain a pre-trained discriminator; and according to a preset learning rate and the parameters of the pre-trained discriminator, carrying out repeated iteration updating on the parameters of the first generator to obtain a second generator, and using the second generator as the pre-trained generator.

In some embodiments, the processor is further configured to: and based on a loss function of statistical information, performing noise increasing processing on the second generator to obtain a pre-trained generator, wherein the parameters of the first generator, the second generator and the pre-trained generator are different.

In some embodiments, the processor is further configured to: carrying out random sampling processing on the discrete variable of the desensitization data to obtain a target discrete variable; predicting the target discrete variable according to the rest discrete variables of the desensitization data based on a logistic regression model to obtain a prediction result of the target discrete variable; adjusting parameters of the pre-trained generator based on the prediction of the target discrete variable.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and the program instructions, when executed, implement any one of the data desensitization methods provided in the embodiment of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The invention relates to a novel application mode of computer technologies such as storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like of a block chain language model. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data desensitization, the method comprising:

2. The method of claim 1, wherein the performing information recognition on the user data based on a pre-trained key information recognition model to obtain key information comprises:

performing word segmentation processing on the user data to obtain a plurality of words;

extracting features of each word segmentation to obtain embedded features of each word segmentation;

performing word sense prediction according to the embedding characteristics of each participle to obtain a word sense corresponding to each participle;

and screening the multiple participles according to the word senses corresponding to the participles to obtain key information.

3. The method according to claim 1, wherein the preprocessing the key information to obtain a discrete variable corresponding to the key information comprises:

performing maximum and minimum normalization processing on the key information to obtain a discrete variable corresponding to the key information; or the like, or, alternatively,

normalizing the key information through a Gaussian mixture model to obtain a discrete variable corresponding to the key information; or the like, or, alternatively,

performing K-bins discretization processing on the key information to obtain discrete variables corresponding to the key information; or the like, or, alternatively,

and performing discretization processing on the key information to obtain discrete variables corresponding to the key information.

4. The method of claim 1, wherein the stitching the conditional embedded vector with the hidden vector to obtain a stitched vector comprises:

converting the condition embedded vector to obtain a one-hot code;

and splicing the one-hot code and the hidden vector to obtain a spliced vector.

5. The method of claim 1, further comprising:

acquiring a splicing vector corresponding to training data, and inputting the splicing vector into a first generator for desensitization treatment to obtain desensitized data;

training a preset discriminator based on the desensitized data and the training data to obtain a pre-trained discriminator;

and according to a preset learning rate and the parameters of the pre-trained discriminator, carrying out repeated iteration updating on the parameters of the first generator to obtain a second generator, and using the second generator as the pre-trained generator.

6. The method of claim 5, wherein after obtaining the second generator, the method further comprises:

and based on a loss function of statistical information, performing noise increasing processing on the second generator to obtain a pre-trained generator, wherein the parameters of the first generator, the second generator and the pre-trained generator are different.

7. The method of claim 1, wherein after obtaining desensitization data, the method further comprises:

carrying out random sampling processing on the discrete variable of the desensitization data to obtain a target discrete variable;

predicting the target discrete variable according to the rest discrete variables of the desensitization data based on a logistic regression model to obtain a prediction result of the target discrete variable;

adjusting parameters of the pre-trained generator based on the prediction of the target discrete variable.

8. A data desensitization apparatus, comprising:

9. A computer device, wherein the computer device comprises a memory and a processor;

the memory for storing a computer program;

the processor is used for executing the computer program and realizing the following when the computer program is executed:

a method of data desensitization according to any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out a data desensitization method according to any one of claims 1 to 7.