CN111737740A

CN111737740A - Multi-party sequence data issuing method and system meeting difference privacy

Info

Publication number: CN111737740A
Application number: CN202010541485.0A
Authority: CN
Inventors: 唐朋; 郭山清; 鞠雷; 刘高源
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2020-10-02
Anticipated expiration: 2040-06-15
Also published as: CN111737740B

Abstract

The utility model provides a multi-party sequence data issuing method and system meeting the difference privacy, belonging to the technical field of data processing.A data owner preprocesses data, adds head and tail character identification, sequence length truncation, character type statistics and the like for each sequence; under the condition of differential privacy, a data owner and a third party utilize a batch processing method, start from the zeroth layer, utilize a node splitting discrimination protocol to split and judge all nodes of each layer, and split nodes with scores exceeding a certain threshold value until a predicted suffix tree is constructed; and generating a new group of data for publishing by a third party according to the constructed predicted suffix tree. The multi-party sequence data set issuing method meeting the differential privacy can issue the sequence data set with higher data utility while meeting the differential privacy protection, and effectively reduces the communication overhead.

Description

Multi-party sequence data issuing method and system meeting difference privacy

Technical Field

The disclosure relates to the technical field of data processing, in particular to a multi-party sequence data issuing method and system meeting differential privacy.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Sequence data is a common type of data.Given an alphabet list

One length of l_enCan be represented as

Wherein x_iIs composed of

One symbol (element) in (1). The sequence data common in life comprises the travel track of citizens, the browsing records of netizens and the like. At present, a traditional sequence data issuing method based on a differential privacy technology mainly solves the problem of issuing sequence data in a single-party scene. In a single party scenario, a single data owner owns all of the sequence data, and the data owner publishes its set of sequence data under differential privacy conditions. Based on a prefix tree model, technicians provide a sequence data issuing method meeting the difference privacy. The method comprises the steps of constructing a prefix tree model by using original sequence data under a differential privacy condition, and then generating new sequence data by using the model; based on a variable-length n-gram model, researchers provide a sequence data issuing method meeting the difference privacy. According to the method, an n-gram model is constructed by using original sequence data under a differential privacy condition, and then new sequence data is generated by using the constructed n-gram model. However, for both of the above approaches, if the depth of the model being built is too great, it may result in the resulting data set being less effective. To solve this problem, researchers have proposed an optimized sequence data set distribution method PrivTree based on a prediction suffix tree model. The method provides a new Laplace mechanism by utilizing the property that the statistical information of the nodes in the prediction suffix tree has monotonicity. The mechanism can enable the size of the Laplace noise added into the non-leaf nodes of the suffix tree to be independent of the depth of the tree, so that the size of the noise is obviously reduced, and the effectiveness of the issued sequence data set is improved.

However, the inventor of the present disclosure finds that, in a multi-party scenario, data respectively belongs to multiple data owners, and during a process that multiple data owners publish multiple sets of local data sets together, the overall published data is prone to reveal individual sensitive information in each local data set, and meanwhile, each data owner may also reveal individual sensitive information in its own local data set to other data owners.

Disclosure of Invention

In order to solve the defects of the prior art, the data owner and the data publisher construct a prediction suffix tree together under the condition of differential privacy, and then the data publisher generates a group of new data for publishing according to the constructed prediction suffix tree.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

the first aspect of the disclosure provides a multi-party sequence data publishing method meeting differential privacy.

A multi-party sequence data issuing method meeting difference privacy is applied to a first terminal and comprises the following steps:

preprocessing the held data sequence;

receiving a predicted suffix tree and a node queue which are sent by a second terminal and only comprise root nodes, judging whether nodes in the node queue need to be split or not by adopting a batch processing mode under the condition of meeting differential privacy, and sending a judgment result to the second terminal so that the second terminal obtains the final structure of the predicted suffix tree;

and under the condition of meeting the difference privacy, calculating a prediction histogram of the node to obtain parameters of a prediction suffix tree, and sending the parameters to the second terminal, so that the second terminal generates a group of new overall sequence data sets according to the structure and the parameters of the prediction suffix tree.

A second aspect of the present disclosure provides a data providing apparatus.

A data providing device comprising a processor communicatively coupled to an external second terminal, the processor configured to:

preprocessing the held data sequence;

The third aspect of the disclosure provides a multi-party sequence data publishing method meeting differential privacy.

A multi-party sequence data issuing method meeting difference privacy is applied to a second terminal and comprises the following steps:

initializing a prediction suffix tree only containing a root node, initializing a node queue for storing nodes which are not traversed, and inserting the root node into the queue;

receiving a node splitting judgment result sent by a first terminal, and obtaining a final structure of the prediction suffix tree when all nodes are split;

and receiving a prediction histogram of a node sent by the first terminal to obtain parameters of a prediction suffix tree, and generating a group of new overall sequence data sets according to the structure and the parameters of the prediction suffix tree.

A fourth aspect of the present disclosure provides a multi-party sequence data issuing apparatus satisfying differential privacy.

A multi-party sequence data dissemination device satisfying differential privacy comprising a processor communicatively coupled to a first terminal, the processor configured to:

The fifth aspect of the present disclosure provides a multiparty sequence data distribution method satisfying differential privacy.

A multi-party sequence data publishing method meeting differential privacy comprises the following steps:

each first terminal preprocesses the held data sequence and keeps the preprocessed data sequence at the first terminal;

secondly, initializing a prediction suffix tree only comprising a root node, initializing a node queue for storing nodes which are not traversed, and inserting the root node into the queue;

the second terminal is combined with the first terminal, whether the nodes in the node queue need to be split or not is judged in a batch processing mode under the condition that differential privacy is met, and the first terminal sends a judgment result to the second terminal;

when all the nodes in the node queue are completely split, the second terminal obtains the final structure of the prediction suffix tree;

the second terminal is combined with the first terminal, under the condition that the difference privacy is met, the prediction histogram of the node is calculated, the parameter of the prediction suffix tree is obtained, and the parameter is sent to the second terminal;

the second terminal generates a new set of overall sequence data based on the structure and parameters of the predicted suffix tree.

A sixth aspect of the present disclosure provides a multiparty sequence data distribution system satisfying differential privacy.

A multi-party sequence data issuing system meeting differential privacy comprises at least two first terminals and at least one second terminal, wherein each first terminal is in communication connection with the second terminal;

the second terminal initializes a prediction suffix tree only containing a root node, initializes a node queue for storing nodes which are not traversed, and inserts the root node into the queue;

Compared with the prior art, the beneficial effect of this disclosure is:

1. according to the method, the device or the system, the first terminal and the second terminal are firstly used for jointly constructing the predicted suffix tree under the differential privacy condition, and then the second terminal generates a group of new data for publishing according to the constructed predicted suffix tree, so that the problem that a data collector in a traditional data publishing method can steal and reveal sensitive information of a user is solved, and meanwhile, the communication overhead is effectively reduced.

2. Compared with the traditional sequence data issuing method meeting the differential privacy, the method, the device or the system expands the traditional single-party scene to the multi-party scene, ensures that the personal privacy data in the local data set of each participant can not be obtained by other participants, and effectively solves the problem of personal sensitive information leakage in the multi-party data fusion process.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

Fig. 1 is a flowchart of a publishing method provided in embodiment 1 of the present disclosure.

Fig. 2 is a schematic diagram of a predicted suffix tree provided in embodiment 1 of the present disclosure.

Fig. 3 is an example of a node task ranking result provided in embodiment 1 of the present disclosure.

Fig. 4 is an example of generating blocks in batch processing provided in embodiment 1 of the present disclosure.

Fig. 5 is a comparison graph of data utility of the method provided in embodiment 1 of the present disclosure and three methods, Independent, PrivTree, and nonprivacy, under different privacy budgets.

Fig. 6 is a comparison graph of data utility of the method provided in embodiment 1 of the present disclosure and the Independent and PrivTree methods under different numbers of participants.

Fig. 7 is an effect diagram of an improved node splitting discrimination protocol provided in embodiment 1 of the present disclosure.

Fig. 8 is a comparison diagram of the running time of the batch processing method and the node method for node splitting according to embodiment 1 of the present disclosure.

Fig. 9 is a schematic structural diagram of a distribution system provided in embodiment 6 of the present disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example 1:

as shown in fig. 1, an embodiment 1 of the present disclosure provides a multi-party sequence data distribution method satisfying differential privacy, applied to a first terminal (data owner), including the following steps:

preprocessing the held data sequence, and reserving the preprocessed data sequence at the first terminal;

The detailed steps are as follows:

step S1: data owner according to a reasonable sequence length threshold len_maxAdding beginning character $ and ending character $ to the beginning and the end of the original sequence&(ii) a Second pair sequence length exceeds threshold len_maxThe sequence of (a) is truncated.

For example: one piece of original sequence data is S ═ x₁x₂…x_nConvert it into

Judgment of n +2 and len_maxIf n +2>len_maxThen cut off into

Otherwise, the original sequence is reserved.

In detail, in the sequence data distribution problem satisfying the differential privacy, the noise injection amount of the distribution method is proportional to the longest sequence length of the original data, but the too large noise injection amount may reduce the effectiveness of the distribution data. Therefore, the sequence length of the original data set is reasonably limited, and the information loss caused by the change of the sequence length is reduced as much as possible, so that the aim of reducing the noise injection amount is fulfilled.

Step S2: the third party initializes a v containing only root nodes according to the information provided by the K data owners₁And initializing a node queue Q for storing unretraversed nodes, and applying v₁A queue is inserted.

Step S3: and carrying out the judgment on the queue Q. If Q is null, the construction of the predicted suffix tree tau structure is finished, and the step S5 is continuously executed; otherwise, it indicates that there are nodes that need to be split, and the step S4 is continuously executed.

Step S4: and judging whether the nodes in the Q need to be split or not by adopting a batch processing mode by the third party and the K data owners under the condition of meeting the differential privacy. Specifically, the third party takes out a certain number of nodes from the Q, and the third party and the K data owners jointly execute a node splitting discrimination protocol and send the discrimination result to the third party; the third party splits the nodes needing splitting according to the judgment result and puts the split new nodes into Q; and then returns to step S3.

Step S5: parameters are filled in the predicted suffix tree structure generated in S4. That is, for each node, the data owner and the third party utilize under the condition that the differential privacy is satisfied

Calculating a suffix histogram of v, hist (v), wherein the prediction histogram of a node is of a length of

A corresponding set of elements in the vector

One symbol of (2).

Step S6: the third party generates a new set of release data sets D' based on the generated predicted suffix tree tau, and

for step S2, the Prediction Suffix Tree (PST) is a Markov model commonly used to characterize the statistics of sequence data. In a PST, each node v is associated with its prediction sequence dom (v) and prediction histogram hist (v). In particular, dom (v) is a collection of

The symbol of (1), hist (v) is a symbol of length

Each element in the vector is a corresponding set

Is marked as hist (v) x]。

For example, FIG. 2 depicts a PST constructed on a data set D. In a PST, a set is added at the front of the predicted sequence of each node v

One symbol, one child node v' of v can be obtained. Thus, for each child node v 'of v, dom (v) is a suffix to dom (v').

With dom (v)₁) The node of phi is the root node v₁From v₁Initially, the data owner and the third party iteratively split the nodes. Specifically, the data owner and third parties determine whether the score of the current node is greater than a given threshold. Such asIf the score of the node is larger than a given threshold value, splitting the node to obtain

A child node, and by being in dom (v)₁) Front end insertion of

Obtaining a predicted character string of a child node by the middle symbol; otherwise, the node is regarded as a leaf node.

For step S4, taking the splitting node v as an example, specifically:

step S4.1: for K data owners, each data owner P_kFrom its own local data set D_kComputing

Wherein: sum 'of'_ik＝∑_n≠lhist(v)[x_n]-depth(v)*/K+Lap(λ)_k，θ′_k＝(θ-)/K+Lap(λ)_k。

Wherein the content of the first and second substances,

₁for the privacy budget allocated in the PST structure construction step, Lap (lambda)_k＝g_k-h_k，g_kAnd h_kIs formed by P_kTwo independent random variables are generated and distributed according to Gamma, and the density function is as follows:

to meet the differential privacy protection requirement, the privacy budget is equally divided into two₁And₂i.e. by₁＝₂2, and will₁Assigned to the PST structure construction step, will₂Allocating to PST parameter acquisition step;

step S4.2: and circularly calling a minimum value protocol by the K data owners and the third party to obtain:

in order to meet the requirements of privacy protection,

split into the sum of K terms, i.e.:

data owner D_KMaster the knowledge

Other data owner D_iMaster s_i＝-r_iWherein r is_iData owner P_iA random number that is locally generated and known only to itself. (ii) a

Step S4.3: the K data owners and the third party call a maximum value protocol to obtain:

the result is split into the sum of K terms, namely:

and assign them to K data owners;

step S4.4: data owner P_kRegenerating the random number r_kAnd performing encryption E (r)_k)；

Step S4.5: each data owner computing together

Step S4.6: data owner P_kComputing

Step S4.7: each data owner computing together

And jointly decipher to obtain

If it is not

Splitting the node; otherwise, the node does not need to be split.

For the minimum value protocol of step S4.2, specifically:

step S4.2.1: initialization variable s₁＝c₁₁,…,s_K＝c_1K；

Step S4.2.2: data owner P_kRegenerating the random number r_kAnd performing encryption E (r)_k)；P_kComputing

Wherein

From P_k-1Is sent to P_kUp to P_KTo obtain

Step S4.2.3: data owner P_kComputing

And sent to the third party and calculated by the third party

Through common decryption to obtain

Step S4.2.4: if it is

Calculate E (g)₁)＝1-|r|/r,E(g₁) 1+ | r |/r; otherwise, calculate E (g)₁)＝1+|r|/r,E(g₁)＝1-|r|/r；

Step S4.2.5: data owner P_kComputing

And sends a third party, u_kIs a newly generated random number;

step S4.2.6: third party computing

All parties decrypt together to obtain

Step S4.2.7: updating s₁＝-u₁,…,s_K-1＝-u_K-1,s_K＝temp/2-u_K；

For step S5, specifically:

step S5.1: for leaf nodes v, to meet the differential privacy protection requirements, data owners and third parties utilize

The suffix histogram hist (v) of v is calculated. Wherein the prediction histogram of a node has a length of

A corresponding set of elements in the vector

To satisfy the differential privacy protection requirement, each element of hist (v) will be injected with an amount of noise η, subject to a laplace distribution, that scales by 1 ÷ or₂I.e. η ═ Lap (1 ═ Lan-₂) Let the suffix histogram containing noise be written as

Step S5.2: for a non-leaf node v', using the suffix histogram of its child node, calculate the suffix histogram of that node

In detail, the present invention is described in detail,

v is any child node of v'.

For step S6, specifically:

step S6.1: computing

Setting a counter for the total number of sequence pieces to be generated;

step S6.2: the third party first initializes a sequence s₀$, then go to s in an iterative manner₀Inserting symbols;

step S6.3: during the ith iteration, the third party has obtained the subsequence s_i-1And determining the predicted character string as s in tau_i-1According to the probability distribution Pr [ x_i＝x]＝hist(v)[x]/||hist(v)||₁From a set by a third party

In the selection of the symbol x_iAnd x is_iIs inserted into s_i-1Thereby obtaining a sequence s_i；

Step S6.4: if x_iIs composed of&Then s will_iRegarding the sequence as a sequence, and ending the generation of the sequence; otherwise, step S6.3 is continued.

Step S6.5: judging whether the counter is full, if not, continuing to execute the steps S6.2-S6.4; otherwise, the generation of the publishing data set D' is finished.

The scheme for judging the splitting of the batch processing nodes in the embodiment is further described in terms of solving the problem of excessive communication overhead.

For the traditional method of performing the splitting judgment by using a single node, the sequential manner brings too much communication and calculation cost. Specifically, if the fan-out of the PST is equal to l, then the number of nodes in the PST will reach (l)^h-1)/(l-1), h being the height of the PST. When each node is judged to be split, l +1 communication rounds are needed. Therefore, the total number of rounds to construct a complete PST will be as high as (l)^h-1)/(l-1)*(l+1)≈l^hThis results in significant communication and computational costs. For the above problem, the present embodiment is specifically discussed in two aspects, which are specifically as follows:

first, in the node splitting determination, the minimum value needs to be selected from the l sums. To this end, these sums can first be divided into

Pairs, with the smaller sum being selected from each pair simultaneously. Then, these are combined

Selected and divided into

Pairs and select the smaller value from each pair until the minimum value is obtained. Thus, the communication turn of each node is reduced to

Taking fig. 3 as an example, the minimum value is obtained according to the above-described scheme.

Assuming that the leaf node is taken as i-8 sums, first in the first round, 4 smaller sums are selected from 4 pairs. In the second round, two smaller sums are selected from the two pairs. In the last round (i.e., round 3), the minimum sum is selected. The result of each selection is represented as C₁,…,C₇. Furthermore, nodes in different subtrees can be judged simultaneously, i.e. nodes at the same level. Taking the example shown in FIG. 2, assume that the tree is a PST tree, in pair C₇Is subjected to resolution into C₅、C₆Then, can simultaneously pair C₅、C₆And (6) carrying out splitting judgment.

Based on the above discussion, a batch-based construction method is proposed. Judging whether each node needs to be split or not requires multiple interactions (a series of minimum and maximum calculation), and calling the calculation to be completed in each interaction as a task. Then, the multiple interactions can be viewed as a series of ordered tasks (tasks). In the batch processing scheme, on one hand, it is required to ensure that the number of tasks included in each batch is as same as possible, and on the other hand, required marking information (for example, which node each task corresponds to, the number of tasks of the node) is reduced as much as possible. In order to solve the above two problems, the present embodiment provides the following solutions:

in order to ensure that the number of tasks contained in each batch is as same as possible, the concept of 'block' is introduced, and a 'splicing' method is proposed, namely, each block contains a plurality of tasks, the tasks come from different nodes, the number of the tasks from the different nodes is different, and the total number of the tasks contained in the block is fixed. The specific description is as follows.

First, the tasks of each node are ordered. As shown in FIG. 3, a node contains 8 values m₁,m₂,…,m₈To select the minimum value among the 8 values, 7 comparisons T are required₁,T₂,…,T₇And thus the number of tasks of the node is 7. These 8 values and 7 tasks can form a tree with a depth of 4, which is composed of

And (4) calculating. Wherein, the leaf layer is a numerical value, and the non-leaf layer is a task. And setting the layer where the leaf nodes are located as the 3 rd layer and the layer where the root nodes are located as the 0 th layer. Analysis shows that for any path from a leaf to a root node, a task at the ith layer on the path must wait for a task at the (i +1) th layer to complete. Tasks from different nodes are then stitched together to form a block. Specifically, as shown in the left diagram of FIG. 4, this block contains tasks from layer 2 at the ith node, tasks from layer 1 at the (i-1) th node, and tasks from layer 0 at the (i-2) th node. Thus, the entire block contains 7 tasks from 3 nodes.

To reduce the required marking information, a "sliding" method is proposed, which is described in detail below.

For tasks from a certain node, their positions in the blocks are constantly sliding downward so that the task at the lowest end of any one block is the last task of a certain node. As shown in the right diagram of fig. 4, for consecutive 3 blocks, the task from the ith node is located at the top three positions in the 1 st block, two positions in the middle in the second block, and one position at the lowest end in the 3 rd block. Thus, by the number of the block and the position of the task in the block, it can be determined from which node each task is, the number of tasks of the node, thereby reducing the required marking information.

Through formal analysis, the multi-party sequence data distribution method (DPST) satisfying differential privacy in the present embodiment can provide higher utility of distributed data and lower communication overhead while satisfying differential privacy protection.

The comparison method is set for experiments, and after the experimental results are analyzed, the issued data of the proposed DPST method is determined to have better data utility, the communication overhead in the issuing process is lower, and the experimental methods are shown in Table 1.

Table 1: experimental methods

In order to better illustrate the advantages of the algorithm of the embodiment, for the test of the data utility, the accuracy (precision) and the sequence length distribution error (totalvanion distribution) of the frequent sequence mining result of top-k which are widely used measurement standards are adopted to compare the data utility of the data issued by the algorithm; for the test of the performance of the algorithm, namely the communication overhead, the operation time of the algorithm is compared and analyzed according to the same input environment. Each algorithm was run 100 times in duplicate for each set of experiments and the average of the results was recorded.

The data used in the experiment were from two real sets of data, and the specific characteristics of each database data are shown in table 2 below.

Table 2: data characteristics in the database.

The usability of the DPST algorithm is illustrated by analyzing experimental data below.

Firstly, the data utility of the data issued by each algorithm is experimentally analyzed, which is specifically shown in fig. 4 and 5.

Comparing the data utility of the method (i.e. DPST algorithm) described in this embodiment with that of the Independent, PrivTree, and nprivacy algorithms under different privacy budgets, wherein: the privacy parameters are set to {0.1,0.2,0.4,0.8,1.0,1.6}, respectively, and the number of data owners is fixed to 2.

The accuracy (precision) of top-k frequent sequence mining results for each algorithm release data is shown in (a) -5 in fig. 5; fig. 5 (e) to 5 (f) show the sequence length distribution error (total variation distance) of each algorithm release data. The precision and total variation distance of NoPrivacy are independent of privacy changes and represent the best results that can be achieved.

In all experiments, DPST can obtain the same good effect as PrivTree because DPST utilizes a node splitting protocol, i.e., a prediction tree can be constructed, and also PrivTree can be constructed, and the noise injection amount is the same for both methods. It can also be seen that DPST is superior to the Independent test in all cases because the Independent requires that each party inject a share of noise into the dataset to ensure that its dataset meets differential privacy, which results in poor data utility for the final integrated whole dataset.

The method of this example (i.e., DPST algorithm) was compared to the data utility of both Independent and PrivTree algorithms at different numbers of participants, where: the number of data owners is set to {2,3,4,6,8,10}, and the privacy parameter is fixed to 0.4.

The accuracy (precision) of top-k frequent sequence mining results for each algorithm release data is shown in (a) -6 in fig. 6; fig. 6(e) -6(f) show the sequence length distribution error (totalvanisation distance) of the published data of each algorithm. DPST achieved as good a utility as PrivTree in all experiments, and changing the number of participants had no effect on both methods. Furthermore, it can be seen that DPST is superior to Independent in all cases. Experimental results prove that independency is sensitive to the change of the number of participants, because the noise injection amount of an independency algorithm to a node is increased along with the increase of the number of participants, so that the performance of independency is worse than that of DPST.

Secondly, the performance effect of the improved node splitting discrimination protocol IMSP is evaluated by comparing the running time of the IMSP with the running time of the BMSP, as shown in fig. 7. Fig. 7 (a) shows the runtime (in seconds) of IMSP and BMSP at different fan-outs of PST, where the number of participants is set to 2; fig. 7 (b) shows the run times (in seconds) of the IMSP and BMSP in the scenario where the number of participants is different, with the fan-out of the PST set to 10. It can be observed that in all cases, the runtime of IMSPs is much less than BMSPs, and this difference becomes more pronounced as PST fan-out or number of participants increases. Meanwhile, the runtime of IMSP tends to scale linearly with the number of fanouts, while the runtime of BMSP grows quadratically with the number of fanouts.

The effectiveness of node splitting based on batch processing is then evaluated by comparing the run times of the BBM method and the NBM method on the data set, as shown in fig. 8. Fig. 8 (a) and fig. 8 (b) show the run times (in minutes) of BBM and NBM at different privacy budgets, with the number of participants set to 2; fig. 8 (c) and fig. 8 (d) show the run time (in minutes) of BBM and NBM at different numbers of participants, with the privacy budget set to 0.4. It can be observed that as the privacy budget increases, the runtime of both BBM and NBM becomes longer. This is because the PSTs constructed by BBM and NBM have larger heights and more nodes under a larger privacy budget. Furthermore, it can be observed that in all cases the runtime of the BBM is much smaller than NBM, and the difference becomes more pronounced as the privacy budget or number of participants increases. The reason is that the communication time between the parties is remarkably reduced by performing node splitting judgment by using batch processing; by utilizing parallel computing, the computing time of the parties is significantly reduced.

Example 2:

the embodiment 2 of the present disclosure provides a data providing device, which includes a processor, where the processor is communicatively connected to an external second terminal, and the processor is configured to:

preprocessing the held data sequence;

receiving a predicted suffix tree and a node queue which are sent by a second terminal and only comprise root nodes, judging whether the nodes in the node queue need to be split or not in a batch processing mode under the condition that differential privacy is met, and sending a judgment result to the second terminal so that the second terminal can obtain a final predicted suffix tree structure;

The working mode of the device is the same as that of the method in embodiment 1, and the description is omitted here.

Example 3:

the embodiment 3 of the present disclosure provides a multi-party sequence data issuing method meeting the difference privacy, which is applied to a second terminal, and includes the following steps:

The detailed method is the same as that in example 1, and is not described herein again.

Example 4:

the embodiment 4 of the present disclosure provides a multi-party sequence data issuing device satisfying differential privacy, including a processor, the processor being connected in communication with a first terminal, the processor being configured to:

Example 5:

the embodiment 5 of the present disclosure provides a multi-party sequence data issuing method meeting the difference privacy, including the following steps:

Example 6:

as shown in fig. 9, embodiment 6 of the present disclosure provides a multi-party sequence data distribution system satisfying differential privacy, including at least two first terminals and at least one second terminal, where each first terminal is communicatively connected to the second terminal;

The working method of the system is the same as that in embodiment 1, and is not described again here

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A multi-party sequence data issuing method meeting difference privacy is applied to a first terminal and comprises the following steps:

preprocessing the held data sequence;

receiving a predicted suffix tree and a node queue which are sent by a second terminal and only comprise root nodes, judging whether the nodes in the node queue need to be split or not by adopting a batch processing mode according to a preprocessed data sequence under the condition of meeting differential privacy, and sending a judgment result to the second terminal so that the second terminal obtains the final structure of the predicted suffix tree;

2. The multi-party sequence data issuing method satisfying differential privacy as claimed in claim 1, wherein the preprocessing specifically is: adding a start symbol and an end symbol for the data sequence, and truncating the data sequence with the length larger than a preset threshold value;

or the first terminal and the second terminal carry out data interaction, and jointly execute a node splitting discrimination protocol to judge whether each node needs to be split or not;

or, the batch processing mode specifically includes: dividing tasks in a data block mode, wherein each block comprises a plurality of tasks, the tasks in each block are from different nodes, the number of the tasks from the different nodes is different, and the total number of the tasks in each block is fixed;

for tasks from a certain node, their positions in the blocks are continuously slid downwards, so that the task at the lowest end of any one block is the last task of a certain node;

or, calculating a prediction histogram of the node, specifically:

calculating suffix histograms of all leaf nodes, and injecting Laplace noise into each dimension of data of the suffix histograms, wherein the process of noise injection every time meets the difference privacy;

for all non-leaf nodes, the suffix histogram is the sum of suffix histograms of all leaf nodes in a subtree taking the node as a root node;

or, the method for generating each sequence of the new overall sequence data set specifically comprises: :

initialization sequence s₀After $, inserting characters in sequence at the end of the sequence;

the ith insertion procedure was: for the currently generated sequence s_i-1＝$x₁x₂…x_i-1Finding out the node with prediction sequence equal to current generation sequence in tau, selecting symbol x according to preset probability distribution_iInsertion of s_i-1Terminal, i.e. generation of new subsequences s_i；

If x_i≠&Continuing to execute the insertion process; otherwise, the sequence generation ends, where $ is the start match,&is the end symbol.

3. A data providing apparatus, comprising a processor communicatively coupled to an external second terminal, the processor configured to:

preprocessing the held data sequence;

4. The data providing device according to claim 3, wherein the preprocessing is specifically: adding a start symbol and an end symbol for the data sequence, and truncating the data sequence with the length larger than a preset threshold value;

or the processor and the second terminal perform data interaction, jointly execute a node splitting discrimination protocol and judge whether each node needs to be split or not;

or, the batch processing mode specifically includes: the task division is carried out in a data block mode, each block comprises a plurality of tasks, the tasks in each block are from different nodes, the number of the tasks from the different nodes is different, the total number of the tasks in each block is fixed, and for the tasks from a certain node, the positions of the tasks in the block continuously slide downwards, so that the task at the lowest end of any one block is the last task of the certain node;

or, calculating a prediction histogram of the node, specifically:

5. A multi-party sequence data issuing method meeting difference privacy is applied to a second terminal and comprises the following steps:

6. The multi-party sequence data issuing method satisfying differential privacy as claimed in claim 5, wherein the preprocessing specifically includes: adding a start symbol and an end symbol for the data sequence, and truncating the data sequence with the length larger than a preset threshold value;

or the second terminal is combined with the first terminal to jointly execute a node splitting discrimination protocol and judge whether each node needs to be split or not;

or, calculating a prediction histogram of the node, specifically:

insertion iThe process is as follows: for the currently generated sequence s_i-1＝$x₁x₂…x_i-1Finding out the node with prediction sequence equal to current generation sequence in tau, selecting symbol x according to preset probability distribution_iInsertion of s_i-1Terminal, i.e. generation of new subsequences s_i；

7. A multi-party sequence data distribution device satisfying differential privacy, comprising a processor communicatively coupled to a first terminal, the processor configured to:

8. The multi-party sequence data distribution device meeting the differential privacy requirement of claim 7, wherein the preprocessing specifically comprises: adding a start symbol and an end symbol for the data sequence, and truncating the data sequence with the length larger than a preset threshold value;

or, calculating a prediction histogram of the node, specifically:

9. A multi-party sequence data issuing method meeting difference privacy is characterized by comprising the following steps:

each first terminal preprocesses the held data sequence;

the second terminal is combined with the first terminal, whether the nodes in the node queue need to be split or not is judged by adopting a batch processing mode according to the preprocessed data sequence under the condition that differential privacy is met, and the first terminal sends a judgment result to the second terminal;

10. The system is characterized by comprising at least two first terminals and at least one second terminal, wherein each first terminal is in communication connection with the second terminal;