CN112269841A

CN112269841A - Data generation method and device for data generation

Info

Publication number: CN112269841A
Application number: CN202011018144.1A
Authority: CN
Inventors: 郝天一; 陈琨
Original assignee: Huakong Tsingjiao Information Technology Beijing Co Ltd
Current assignee: Huakong Tsingjiao Information Technology Beijing Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2021-01-26

Abstract

The embodiment of the invention provides a data generation method and device and a device for data generation. According to the method, original data to be processed are obtained from an original data set, adjacent original data of the original data are obtained according to other original data in the original data set, for first subdata of any data type contained in the original data, analog subdata corresponding to the data type is generated according to the first subdata and second subdata in the adjacent original data, and analog data of the original data are generated according to the analog subdata corresponding to the data type. The probability that the simulation data can still embody the content of the single original data can be reduced to a certain extent, and then information leakage is avoided to a certain extent. Meanwhile, the method can ensure that the association distribution among the finally generated simulation data is similar to the association distribution among the original data in the original data set to a certain extent, thereby improving the data generation effect.

Description

Data generation method and device for data generation

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data generation method and apparatus, and an apparatus for data generation.

Background

At present, a series of operations, such as training a model, detecting model performance, performing multi-party security calculations, etc., are often performed using raw data in a raw data set. To ensure data security, it is often necessary to generate a simulated data set from the original data set, using the simulated data set for calculations.

In the conventional method, analog data is often generated by random resampling according to sub-data in multiple dimensions contained in the original data. The analog data generated in this way may still reflect the content of the original data, and thus, the generated analog data may cause the original data to be leaked, and the data generation effect is poor.

Disclosure of Invention

Embodiments of the present invention provide a data generation method, a data generation device, and a data generation device, so that generated simulation data can avoid information leakage, and association distribution between simulation data generated last is ensured to be similar to association distribution between original data in an original data set to a certain extent, thereby improving a data generation effect.

In order to solve the above problem, an embodiment of the present invention discloses a data generation method, where the method includes:

acquiring original data to be processed from an original data set;

acquiring adjacent original data of the original data according to other original data in the original data set;

for first subdata of any data type contained in the original data, generating analog subdata corresponding to the data type according to the first subdata and second subdata in the adjacent original data; the data type of the second subdata is matched with the data type of the first subdata;

and generating the simulation data of the original data according to the simulation subdata corresponding to the data type.

On the other hand, the embodiment of the invention discloses a data generation device, which comprises:

the first acquisition module is used for acquiring original data to be processed from an original data set;

a second obtaining module, configured to obtain, according to other raw data in the raw data set, raw data adjacent to the raw data;

a first generation module, configured to generate, for first subdata of any data type included in the original data, analog subdata corresponding to the data type according to the first subdata and second subdata in the adjacent original data; the data type of the second subdata is matched with the data type of the first subdata;

and the second generation module is used for generating the simulation data of the original data according to the simulation subdata corresponding to the data type.

In yet another aspect, an embodiment of the present invention discloses an apparatus for data generation, including a memory, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the one or more processors include instructions for:

acquiring original data to be processed from an original data set;

In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a data generation method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

the method and the device for processing the data type of the original data acquire the original data to be processed from the original data set, acquire adjacent original data of the original data according to other original data in the original data set, and generate analog subdata corresponding to the data type for first subdata of any data type contained in the original data according to the first subdata and second subdata in the adjacent original data, wherein the data type of the second subdata is matched with the data type of the first subdata. And finally, generating the simulation data of the original data according to the simulation subdata corresponding to the data type. Compared with the mode of generating the simulation data directly according to the content of the original data, the embodiment of the invention comprehensively generates the simulation data by combining the selected original data and the adjacent original data, can reduce the probability that the simulation data still can embody the content of the single original data to a certain extent, and further avoids information leakage to a certain extent. Meanwhile, the mode of generating the simulation data by combining the original data and the adjacent original data can ensure that the association distribution between the finally generated simulation data is similar to the association distribution between the original data in the original data set to a certain extent, thereby improving the data generation effect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of one embodiment of a data generation method of the present invention;

FIG. 2 is a block diagram of an embodiment of a data generating apparatus according to the present invention;

FIG. 3 is a block diagram of an apparatus 800 for data generation of the present invention; and

fig. 4 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment

Referring to fig. 1, a flow chart of steps of an embodiment of a data generation method of the present invention is shown, the method comprising the steps of:

step 101, obtaining raw data to be processed from a raw data set.

And 102, acquiring adjacent original data of the original data according to other original data in the original data set.

Step 103, for first subdata of any data type contained in the original data, generating analog subdata corresponding to the data type according to the first subdata and second subdata in the adjacent original data; the data type of the second subdata is matched with the data type of the first subdata.

And step 104, generating the simulation data of the original data according to the simulation subdata corresponding to the data type.

In the data generation method of the embodiment of the invention, original data to be processed is acquired from an original data set, adjacent original data of the original data is acquired according to other original data in the original data set, and for first subdata of any data type contained in the original data, analog subdata corresponding to the data type is generated according to the first subdata and second subdata in the adjacent original data, wherein the data type of the second subdata is matched with the data type of the first subdata. And finally, generating the simulation data of the original data according to the simulation subdata corresponding to the data type. Compared with the mode of generating the simulation data directly according to the content of the original data, the embodiment of the invention comprehensively generates the simulation data by combining the selected original data and the adjacent original data, can reduce the probability that the simulation data still can embody the content of the single original data to a certain extent, and further avoids information leakage to a certain extent. Meanwhile, the mode of generating the simulation data by combining the original data and the adjacent original data can ensure that the association distribution between the finally generated simulation data is similar to the association distribution between the original data in the original data set to a certain extent, thereby improving the data generation effect.

Furthermore, according to the data type, the original data and the adjacent original data are combined to generate the simulation data, so that the data type of the finally generated simulation data can be ensured to be matched with the original data type, and the data generation effect is further ensured.

In an optional implementation manner of the embodiment of the present invention, the operation of obtaining neighboring original data of the original data according to other original data in the original data set in step 102 may include:

and step S11, determining a distance between the original data and each of the other original data.

Step S12, determining the number r to be selected according to the total number n of the original data contained in the original data set; the r is positively correlated with the n.

And step S13, selecting the first r pieces of other original data with the shortest distance as the adjacent original data.

Wherein the other raw data represents raw data in the raw data set other than the raw data. The distance between the original data and the other original data may characterize the degree of approximation between the original data and the other original data. Further, the larger n is, the larger r may be, and the smaller n is, the smaller r may be. When the adjacent original data is selected, the adjacent original data can be sorted according to the corresponding distances of other original data in the order from the largest to the smallest, and then the first r other original data of the sorting result are determined as the adjacent original data. Or, the data may be sorted according to the distances corresponding to the other pieces of original data in the order from the smaller one to the larger one, and then the last r pieces of original data of the sorting result may be determined as the adjacent original data. It should be noted that, the adjacent raw data may also be selected based on a smote algorithm, so that the raw data of different tags appears approximately the same frequency.

In the embodiment of the invention, the number r of the adjacent original data needing to be selected is determined according to the total number n of the original data contained in the original data set, and r adjacent original data are correspondingly selected. Therefore, the proportion of the subsequently selected adjacent original data can be effectively controlled, and the problems that the representativeness of the adjacent original data is poor and the subsequent data generation effect is poor due to the fact that too much adjacent original data is selected under the condition that the original data set is small in scale, namely, the included original data is less are avoided. Meanwhile, the problem of excessive calculation amount brought to subsequent processing can be avoided by controlling the quantity of the selected adjacent original data, and the data generation efficiency is further improved to a certain extent.

In an optional implementation manner of the embodiment of the present invention, the operation of determining the distance between the original data and each of the other original data in step S11 may include:

step S11a, for any of the other original data, calculating a standard deviation and a difference between the subdata of the first data type in the original data and the subdata of the first data type in the other original data; the first data type is an integer type or a floating point type.

Step S11b, calculating a distance between the original data and each of the other original data according to a standard deviation and a difference between the original data and each of the other original data; wherein the distance is inversely related to the standard deviation and the distance is positively related to the absolute value of the difference.

In specific implementation, the subdata under each data dimension of the subdata of the first data type in the original data and the subdata under each data dimension of the subdata of the first data type in the other original data can be used as the input of a preset standard deviation calculation formula, and then the output of the preset standard deviation calculation formula is determined as the standard deviation between the original data and the other original data. Or, respectively taking the subdata under each data dimension of the subdata of the first data type in the original data and the subdata under each data dimension of the subdata of the first data type in the other original data as the input of a preset standard deviation calculation formula, calculating the standard deviation under each data dimension, and then determining the sum of the standard deviations under each data dimension as the standard deviation between the original data and the other original data. Further, the difference between the subdata of the first data type in the original data in each data dimension and the subdata of the first data type in other original data in each data dimension may be calculated, and then the sum of the differences may be calculated, thereby obtaining the difference between the subdata of the first data type in the original data and the subdata of the first data type in other original data. It should be noted that, in the embodiment of the present invention, the distance between every two original data in the original data set may also be calculated in advance according to the above calculation method, and the distance is stored in the matrix. Accordingly, the matrix can be directly searched to obtain the distance between the original data and other original data.

In the embodiment of the invention, the standard deviation and the difference value are calculated according to the subdata of the first data type in the original data and the subdata of the first data type in other original data, so that the subdata of the data type which cannot be calculated can be prevented from being introduced, and the calculation operation can be ensured to be carried out smoothly. Further, compared with a mode of directly determining the distance according to the difference, the distance is calculated according to the principle that the distance is in negative correlation with the standard deviation and in positive correlation with the absolute value of the difference by combining the standard deviation and the difference in the embodiment of the invention, the difference can be balanced through the standard deviation, and the accuracy of the distance is further improved to a certain extent.

In an optional implementation manner of the embodiment of the present invention, step S11b may include:

step S11b1, for any other original data, performing target operation on the ratio between the standard deviation and the difference value to obtain a target ratio; the target ratio is not less than 0.

Step S11b2, determining the number of occurrences of subdata with different values in the subdata of the second data type in the original data and the subdata of the second data type in the other original data; the second data type comprises an enumerated type.

Step S11b3, determining the sum of the occurrence times and the target ratio as the distance between the original data and the other original data.

Wherein the target operation may be an operation capable of converting a ratio between the standard deviation and the difference value into a non-negative number. For example, the target operation may be to take the absolute value of the difference or to calculate the square of the difference. Accordingly, the target ratio obtained through the target operation may be a number not less than 0. Further, taking the example that the subdata of the second data type includes characteristic values corresponding to characteristics under different data dimensions, the size of a set formed by characteristics with different characteristic values in the subdata of the second data type in the original data and the subdata of the second data type in other original data can be determined, and then the occurrence times can be obtained. Finally, the sum of the number of occurrences and the target ratio may be calculated to obtain the distance between the original data and the other original data.

For example, the sub-data in different data dimensions may characterize features in different dimensions, the sub-data in a data dimension may be feature values of the features in the data dimension, and assuming that the raw data includes feature values corresponding to features in m data dimensions, the raw data may be represented as x_k＝(x_k1，x_k2，…，x_km). Defining the sets of the characteristic value of the floating point type, the characteristic value of the integer type and the characteristic value of the enumeration type as I₁，I₂，I₃. In the embodiment of the invention, the distance after feature normalization can be calculated according to the data dimensions and the feature values under the data dimensions. In particular, the characteristic I ∈ I for integer types or floating point types₁∪I₂The standard deviation can be calculated as σ_i. When a plurality of data dimensions exist, a plurality of sigma corresponding to the plurality of data dimensions can be obtained_i. For a feature I ∈ I₃Can be based on the above stepsS11b2 determines the number of occurrences. In one implementation, the raw data x_kAnd other raw data x_jA distance d between_kjCan be expressed as:

in the embodiment of the invention, when the distance is calculated, the difference value between the features is divided by the standard deviation of the features under the data dimension, so that the influence of each data dimension feature on the distance calculation is approximately the same to a certain extent, the distance calculation is not influenced by the feature scales of different data dimensions to a certain extent, and the accuracy of the calculated distance is improved. Of course, other methods for calculating the distance may also be used, and the embodiment of the present invention is not limited thereto.

In an optional implementation manner of the embodiment of the present invention, the determining, in step S12, the number r to be selected according to the total number n of the original data included in the original data set may include:

step S12a, calculating a ratio of the total number n of the raw data to a first preset value, and calculating a logarithm of the total number n of the raw data to a second preset value.

And step S12b, determining the minimum value of the ratio and the logarithm as the number r to be selected.

The first preset value and the second preset value may be preset according to actual conditions. For example, the first preset value may be 2, and the second preset value may be 10. Accordingly, the calculated ratio can be expressed as

The calculated logarithm may be represented as 10ln n. In specific implementation, the minimum value may be rounded down to avoid the problem that the finally determined number r to be selected is a non-integer, which may cause the subsequent normal selection of the corresponding number of adjacent original data. The process of determining r can be expressed as:

in the embodiment of the invention, the r can be accurately determined according to the scale of the original data set by calculating the ratio of the total number n of the original data to the first preset value, calculating the logarithm of the total number n of the original data relative to the second preset value, and finally determining the minimum value of the ratio and the logarithm as the quantity to be selected, thereby ensuring the accuracy of the determination operation to a certain extent.

In an optional implementation manner of the embodiment of the present invention, the operation of generating the analog sub-data corresponding to the data type according to the first sub-data and the second sub-data in the adjacent original data in step 103 may include:

step S21, if the data type of the first subdata is a first data type, obtaining a fusion weight for the first subdata under any data dimension in the first subdata; calculating a weighted sum between the first subdata under the data dimension and the second subdata under the data dimension according to the fusion weight; and generating the simulation subdata according to the weighted sum.

Step S22, if the data type of the first subdata is a second data type, obtaining a numerical value of which the occurrence times of the first subdata under the data dimension and the second subdata under the data dimension meet a preset condition for the first subdata under any data dimension in the first subdata; and generating the analog subdata according to the numerical value of which the occurrence times meet the preset condition.

In a specific implementation, the first data type may be an integer type or a floating point type, and the second data type may be an enumeration type. When the fusion weight is obtained, a numerical value can be randomly selected from a preset value range to serve as the fusion weight. The preset value range may be set according to actual requirements, and for example, the preset value range may be [0, 1%]. By randomly choosing the fusion weight at each calculation, the combination can be carried outThe randomness of the finally generated simulation data is increased while the original data and the adjacent original data are simultaneously generated, and then sensitive information is prevented from being leaked. Further, the first subdata under the data dimension may be a characteristic value of a feature under the data dimension in the original data, the second subdata under the data dimension may be a characteristic value of a feature under the data dimension adjacent to the original data, s represents a fusion weight, and a weighted sum obtained by calculation may be represented as:

wherein x is_kRepresenting the current raw data, x_kiRepresenting the first subdata, x, in the i dimension of the original data_pRepresenting adjacent raw data, x_piRepresenting second child data in the i-dimension of the adjacent original data. It should be noted that, in the case that there are a plurality of neighboring original data, one neighboring original data may be randomly selected as x_pSo as to increase the randomness of the calculation result and further improve the confidentiality of the finally generated simulation data. Further, the calculated weighted sum can be directly used as the simulated subdata in the data dimension.

Optionally, in the embodiment of the present invention, when the first data type is an integer type, before generating the analog sub-data according to the weighted sum, whether the data type of the weighted sum is the integer type may be detected; if the data type of the weighted sum is not an integer type, then a rounding operation will be performed on the weighted sum. For example, when performing the rounding operation, the weighted sum may be adjusted to be of an integer type according to a rounding algorithm, or a fractional part of the weighted sum may be directly discarded to obtain a weighted sum of an integer type. In this way, when the first data type is an integer type, the rounding operation is performed on the calculated weighted sum, so that the finally generated analog subdata can be ensured to be the same as the data type of the original data, and the data generation effect can be further ensured. Further, the preset condition may be set according to actual requirements, and for example, the preset condition may be that the occurrence number is the largest. Correspondingly, for the enumerated type data, the value with the largest occurrence number in the first subdata and the second subdata can be used as the simulation subdata under the data dimension. According to the method and the device, the characteristic values under each data dimension are generated by integrating the characteristics between the original data and the adjacent original data, so that sensitive information in the original data is prevented from being leaked while a set formed by finally generated simulation data is ensured to have data distribution similar to that of an original data set to a certain extent.

Optionally, after generating the simulation subdata under each data dimension, all the simulation subdata may be combined to obtain the simulation data of the original data. In the embodiment of the invention, different processing is carried out according to the data types to obtain the analog subdata under different data dimensions, and then the analog subdata is combined to obtain the analog data of the original data, so that the data type of the finally generated analog data is ensured to be the same as the data type of the original data, and the analog effect of the analog data on the original data is further improved.

Optionally, the raw data set in the embodiment of the present invention may be a training data set for model training. In one prior implementation, a separate distribution of each feature in the raw data is separately counted, and each piece of simulation data is then separately randomly generated according to the distribution. However, in this generation method, it is assumed that the distributions of the features of each data dimension are mutually independent, and the correlation between the features cannot be reflected, and further, the generated simulation data cannot reasonably reflect the association between the label of the data and the feature combination when used for model training. In another prior implementation, joint distributions among features are considered, the joint distributions are determined by a joint distribution model, and sampling is performed from the joint distributions to generate simulation data. However, due to the limitation of the joint distribution model on the description capability of the data distribution, the association between the labels and the feature combinations of the data cannot be reasonably reflected in the mode, and the effect of model training using the simulation data is poor. In the embodiment of the invention, the simulated subdata under each data dimension is generated by combining the original data and the data under the same data dimension in a plurality of adjacent original data, namely, the characteristics of the same data dimension are integrated, so that the simulated subdata can embody the association distribution among the characteristics under the data dimension to a certain extent. Accordingly, when model training is subsequently performed by using finally generated simulation data, the effect of model training can be ensured to a certain extent.

Of course, the original data set in the embodiment of the present invention may be other data sets. Such as a test data set for model testing. The method is used for evaluating the performance, accuracy index and convergence of the algorithm model during operation. However, the original data of the algorithm model may not be suitable for being directly delivered to an algorithm developer for testing and use due to the reasons of overlarge data amount, sensitive information, confidentiality requirement of the data and the like. In the embodiment of the invention, the simulation data set of the test data set is generated, so that an algorithm developer can conveniently carry out the test.

In an optional implementation manner of the embodiment of the present invention, after generating the simulation data of the original data, it may be further detected whether data matching the simulation data exists in the original data set; and if the data matched with the simulation data exist, discarding the simulation data. In a specific embodiment, the simulation data may be compared with each original data in the original data set, and if the same original data is compared, it may be determined that there is data matching the simulation data. Accordingly, the step of determining the original data to be processed may be re-executed after discarding or in the event that it is detected that there is no data matching the simulation data, to continue generating new simulation data. In the embodiment of the invention, the original data can be prevented from being leaked to a greater extent by discarding the simulation data which is repeated with the original data, thereby improving the safety of the data.

Further, in an optional implementation manner of the embodiment of the present invention, after analog data of original data is generated each time, whether the total amount of the analog data reaches a preset amount threshold may be detected; if the total quantity of the analog data reaches the preset quantity threshold, stopping the operation of acquiring the original data to be processed from the original data set; wherein the preset quantity threshold is the total quantity of the original data contained in the original data set.

The preset number threshold may be a parameter for controlling the size of the simulation data set formed by the finally generated simulation data. The larger the preset number threshold is, the larger the scale of the finally generated simulation data set is, and the smaller the preset number threshold is, the smaller the scale of the finally generated simulation data set is. In specific implementation, the preset number threshold may also be set to other values, which is not limited in the embodiment of the present invention. Furthermore, in the embodiment of the present invention, by setting the preset number threshold to the total number of the original data included in the original data set, it can be ensured that the scale of the finally generated simulation data set is the same as that of the original data set, so that the similarity between the simulation data set and the original data set can be increased while avoiding leakage of data, and the simulation effect can be further improved to a certain extent.

In an optional implementation manner of the embodiment of the present invention, the operation of acquiring to-be-processed raw data from a raw data set in step 101 may include: and randomly selecting one piece of data from the original data set as the original data to be processed. Specifically, the original data in the embodiment of the present invention may be personal related information of the user, related feature information of a picture, and the like. Further, if the original data set is uniformly traversed to select the original data to be processed, the data arrangement information contained in the original data set is leaked to a certain extent, and then the risk of data leakage of the generated simulation data is increased. In the embodiment of the invention, the risk of data leakage can be reduced by randomly selecting the original data to be processed each time.

A specific example will be described below. It is assumed that the raw data set contains 1309 pieces of raw data, where the raw data is personal related information of a user, and each piece of raw data contains 14 data-dimensional features, which include features of a floating point type, features of an integer type, and features of an enumeration type. First, for 1309 original data of the original data set, the distance between each two of the original data can be calculated first, so as to obtain a distance matrix of 1309 × 1309. Taking the first 10 rows and the first 10 columns of the distance matrix as an example, part of the content of the distance matrix can be expressed as:

[0.00,2.00,2.84,2.61,5.27,5.65,5.28,5.39,6.47,5.07]

[2.00,0.00,2.60,1.88,5.41,5.59,5.15,5.34,6.33,4.92]

[2.84,2.60,0.00,1.77,4.55,4.54,4.50,4.54,4.93,3.86]

[2.61,1.88,1.77,0.00,4.66,4.51,4.57,4.45,5.23,4.10]

[5.27,5.41,4.55,4.66,0.00,3.00,2.58,3.17,3.28,4.81]

[5.65,5.59,4.54,4.51,3.00,0.00,3.60,2.58,3.25,4.32]

[5.28,5.15,4.50,4.57,2.58,3.60,0.00,3.55,3.44,5.17]

[5.39,5.34,4.54,4.45,3.17,2.58,3.55,0.00,3.77,4.66]

[6.47,6.33,4.93,5.23,3.28,3.25,3.44,3.77,0.00,4.54]

[5.07,4.92,3.86,4.10,4.81,4.32,5.17,4.66,4.54,0.00]

wherein the elements on the diagonal of the matrix represent the distance of the original data from itself. Each time of generation, one piece of original data may be randomly selected from the original data set, and for example, the randomly selected original data may be represented as: [ '1', '1', 'Cavendish, Mrs. Tyrell William (Julia Florence Siegel)', 'fe', 76.0,1,0, '19877',78.85, 'C46', 'S', '6', 'Little on Hall, Staffs' ]. In the original data, the 5 th dimension and the 9 th dimension represent the characteristics of a floating point type, the 6 th dimension and the 7 th dimension represent the characteristics of an integer type, and the other dimensions represent the characteristics of an enumeration type.

Then, the number r to be selected may be calculated according to 1309, which is the total data n of the original data included in the original data set, and r may be 71 as an example. Next, 71 neighboring original data of the original data are determined from the distance matrix. Wherein the partial neighborhood raw data can be expressed as:

['1','1','Andrews,Miss.Kornelia Theodosia','female',63.0,1,0,'13502',77.9583,'D7','S','10',”,'Hudson,NY']

['1','1','Stone,Mrs.George Nelson(Martha Evelyn)','female',62.0,0,0,'113572',80.0,'B28',”,'6',”,'Cincinatti,OH']

['1','1','Warren,Mrs.Frank Manley(Anna Sophia Atkinson)','female',60.0,1,0,'110813',75.25,'D37','C','5',”,'Portland,OR']

further, one neighboring original data may be randomly selected from 71 neighboring original data, for example, the neighboring original data is selected: [ '2', '0', ' Howard, Mrs. Benjamin ' (Ellen blue Arman) ', ' fe ',60.0,1,0, '24065',26.0, ", ' S ',", ", ' Swindon, England ' ].

Further, the analog subdata corresponding to the floating point type and the integer type can be generated according to the original data and the randomly selected adjacent original data. For example, the analog subdata corresponding to the floating point type and the integer type may be represented as: [ None,63.0503,1,0, None,78.6783, None ]. Where "None" is expressed as a characteristic value of the enumerated type of computation.

Further, the simulated subdata corresponding to the enumeration type can be generated according to the original data and all the adjacent original data. And then, combining all the analog subdata to obtain analog data corresponding to the original data. By way of example, the simulation data may be represented as: [ '1', '1', 'Stengel, Mr. Charles oil Henry', 'male',63.0503,1,0, '33638',78.6783, ", 'S',", ", 'New York, NY' ]

Finally, the above steps can be repeated 1309 times, and a simulation data set with the same size as the original data set is obtained. For example, the resulting simulated dataset may have 1309 rows and 14 columns, which are the same as the distribution of the original dataset. In the embodiment of the invention, the original data and the adjacent original data are synthesized, so that the sensitive information is not leaked, and meanwhile, the simulation data sets with the same scale and the same distribution are generated for machine learning training.

Device embodiment

Referring to fig. 2, a block diagram of a data generating apparatus according to an embodiment of the present invention is shown, where the apparatus may specifically include:

a first obtaining module 201, configured to obtain raw data to be processed from a raw data set;

a second obtaining module 202, configured to obtain, according to other raw data in the raw data set, adjacent raw data of the raw data;

a first generating module 203, configured to generate, for first subdata of any data type included in the original data, analog subdata corresponding to the data type according to the first subdata and second subdata in the adjacent original data; the data type of the second subdata is matched with the data type of the first subdata;

the second generating module 204 is configured to generate analog data of the original data according to the analog sub-data corresponding to the data type.

Optionally, the second obtaining module 202 is specifically configured to:

determining a distance between the raw data and each of the other raw data;

determining the number r to be selected according to the total number n of the original data contained in the original data set; said r is positively correlated with said n;

and selecting the first r pieces of other original data with the nearest distance as the adjacent original data.

Optionally, the second obtaining module 202 is further specifically configured to:

calculating the ratio of the total number n of the original data to a first preset value, and calculating the logarithm of the total number n of the original data relative to a second preset value;

and determining the minimum value of the ratio and the logarithm as the number r to be selected.

Optionally, the second obtaining module 202 is further specifically configured to: for any other original data, calculating a standard deviation and a difference value between subdata of a first data type in the original data and subdata of the first data type in the other original data; the first data type is an integer type or a floating point type;

calculating the distance between the original data and each other original data according to the standard deviation and the difference between the original data and each other original data; wherein the distance is inversely related to the standard deviation and the distance is positively related to the absolute value of the difference.

Optionally, the second obtaining module 202 is further specifically configured to: for any other original data, performing target operation on the ratio between the standard deviation and the difference value to obtain a target ratio; the target ratio is not less than 0;

determining the occurrence times of subdata with different values in subdata of a second data type in the original data and subdata of a second data type in the other original data; the second data type comprises an enumerated type;

and determining the sum of the occurrence times and the target ratio as the distance between the original data and the other original data.

Optionally, the first generating module 203 is specifically configured to:

if the data type of the first subdata is a first data type, acquiring a fusion weight for the first subdata under any data dimension in the first subdata; calculating a weighted sum between the first subdata under the data dimension and the second subdata under the data dimension according to the fusion weight; generating the simulation subdata according to the weighted sum;

if the data type of the first subdata is a second data type, acquiring numerical values of the first subdata under the data dimension and the second subdata under the data dimension, wherein the occurrence times of the first subdata under the data dimension and the second subdata under the data dimension meet preset conditions, for the first subdata under any data dimension; and generating the analog subdata according to the numerical value of which the occurrence times meet the preset condition.

Optionally, the first generating module 203 is further specifically configured to: and randomly selecting a numerical value from a preset value range as the fusion weight.

Optionally, the first generating module 203 is further specifically configured to:

detecting whether the data type of the weighted sum is an integer type or not under the condition that the first data type is the integer type;

and if the data type of the weighted sum is not an integer type, rounding the weighted sum.

Optionally, the apparatus further comprises:

the first detection module is used for detecting whether data matched with the simulation data exist in the original data set or not;

and the discarding module is used for discarding the analog data if the data matched with the analog data exists.

Optionally, the first obtaining module 201 is specifically configured to:

and randomly selecting one piece of data from the original data set as the original data to be processed.

Optionally, the apparatus further comprises:

the second detection module is used for detecting whether the total quantity of the analog data reaches a preset quantity threshold value;

the stopping module is used for stopping the operation of acquiring the original data to be processed from the original data set if the total amount of the analog data reaches the preset number threshold; wherein the preset quantity threshold is the total quantity of the original data contained in the original data set.

Optionally, the second generating module 204 is specifically configured to

And combining all the analog subdata to obtain the analog data of the original data.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention provides an apparatus for data generation, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by one or more processors include instructions for: acquiring original data to be processed from an original data set; acquiring adjacent original data of the original data according to other original data in the original data set; for first subdata of any data type contained in the original data, generating analog subdata corresponding to the data type according to the first subdata and second subdata in the adjacent original data; the data type of the second subdata is matched with the data type of the first subdata; and generating the simulation data of the original data according to the simulation subdata corresponding to the data type.

Fig. 3 is a block diagram illustrating an apparatus 800 for data generation according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 3, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 4 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer-readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the data generation method shown in fig. 1.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform a data generation method, the method comprising: acquiring original data to be processed from an original data set; acquiring adjacent original data of the original data according to other original data in the original data set; for first subdata of any data type contained in the original data, generating analog subdata corresponding to the data type according to the first subdata and second subdata in the adjacent original data; the data type of the second subdata is matched with the data type of the first subdata; and generating the simulation data of the original data according to the simulation subdata corresponding to the data type.

The embodiment of the invention discloses A1 and a data generation method, wherein the method comprises the following steps:

acquiring original data to be processed from an original data set;

A2, the method of claim A1, the obtaining neighboring raw data of the raw data from other raw data in the raw data set, comprising:

determining a distance between the raw data and each of the other raw data;

A3, the method according to claim a2, wherein the determining a number r to be selected according to the total number n of raw data contained in the raw data set comprises:

A4, the method of claim A2, the determining distances between the raw data and the respective other raw data comprising:

for any other original data, calculating a standard deviation and a difference value between subdata of a first data type in the original data and subdata of the first data type in the other original data; the first data type is an integer type or a floating point type;

A5, the method of claim A4, the calculating the distance between the original data and each of the other original data according to the standard deviation and the difference between the original data and each of the other original data, comprising:

for any other original data, performing target operation on the ratio between the standard deviation and the difference value to obtain a target ratio; the target ratio is not less than 0;

The method of claim a6, according to any one of claims a1 to a5, wherein the generating the analog sub-data corresponding to the data type according to the first sub-data and the second sub-data in the adjacent original data includes:

A7, the method of claim A6, the obtaining fusion weights comprising:

and randomly selecting a numerical value from a preset value range as the fusion weight.

A8, the method of claim a6, the method comprising, prior to generating the simulated sub data based on the weighted sum:

A9, the method according to any one of claims a1-a5, wherein after generating the simulated data of the original data according to the simulated subdata corresponding to the data type, the method further includes:

detecting whether data matched with the simulation data exist in the original data set;

and if the data matched with the simulation data exist, discarding the simulation data.

A10, the method according to any one of claims a1-a5, wherein the obtaining raw data to be processed from a raw data set comprises:

A11, the method according to any one of claims a1-a5, wherein after generating the simulated data of the original data according to the simulated subdata corresponding to the data type, the method further includes:

detecting whether the total quantity of the analog data reaches a preset quantity threshold value;

if the total quantity of the analog data reaches the preset quantity threshold, stopping the operation of acquiring the original data to be processed from the original data set;

wherein the preset quantity threshold is the total quantity of the original data contained in the original data set.

The method of claim a12, according to any one of claims a1-a5, wherein the generating the simulated data of the original data according to the simulated subdata corresponding to the data type includes:

The embodiment of the invention discloses B13 and a data generation device, wherein the device comprises:

The embodiment of the invention discloses C14, an apparatus for data generation, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

acquiring original data to be processed from an original data set;

Embodiments of the present invention disclose D15, a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a data generation method as described in one or more of a1-a 12.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The data generating method, the data generating device and the device for generating data provided by the present invention are described in detail above, and specific examples are applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of data generation, the method comprising:

acquiring original data to be processed from an original data set;

2. The method of claim 1, wherein the obtaining neighboring raw data of the raw data from other raw data in the raw data set comprises:

determining a distance between the raw data and each of the other raw data;

3. The method according to claim 2, wherein the determining the number r to be selected according to the total number n of the raw data contained in the raw data set comprises:

4. The method of claim 2, wherein determining the distance between the raw data and each of the other raw data comprises:

5. The method of claim 4, wherein calculating the distance between the original data and each of the other original data according to the standard deviation and the difference between the original data and each of the other original data comprises:

6. The method of any one of claims 1 to 5, wherein the generating, according to the first sub-data and the second sub-data in the adjacent original data, the analog sub-data corresponding to the data type includes:

7. The method of claim 6, wherein the obtaining the fusion weight comprises:

8. An apparatus for generating data, the apparatus comprising:

9. An apparatus for data generation comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for:

acquiring original data to be processed from an original data set;

10. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the data generating method of any of claims 1 to 7.