CN111585751A

CN111585751A - Data sharing method based on block chain

Info

Publication number: CN111585751A
Application number: CN202010284363.8A
Authority: CN
Inventors: 郭兵; 沈艳; 董详千
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2020-08-25

Abstract

The invention discloses a data sharing method based on a block chain. The method includes the problems of the same blockchain and different blockchain switching. Finally, prototype realization is carried out to prove the feasibility of the prototype realization, and the distributed storage of the research and application foundation is laid for the further development of the value internet to store data to each node of the network in a scattered way. The distributed storage is beneficial to improving the safety and the redundancy of the system, and if data of a certain node is modified, deleted or forged, errors can be found and the data can be recovered through data of other nodes. Experimental analysis results show that the method provides a sharing mechanism which is feasible and efficient.

Description

Data sharing method based on block chain

Technical Field

The invention relates to the technical field of block chain data sharing, in particular to a research on a data sharing method based on a block chain.

Background

Data sharing is a prerequisite for data assets to mine their potential value. Traditional data management approaches, such as the data market, follow the following management patterns: and the data provider uploads the data to a data market, and the data demander downloads the data and analyzes the data to obtain the data value. This mode has the following significant disadvantages: firstly, the data searching mode is single, only keyword retrieval or data browsing is provided, and useful data cannot be efficiently obtained; secondly, the data owner loses the control right of the data, and the data ownership and the data security cannot be guaranteed; finally, data transaction lacks transparency, and fraud behaviors such as collusion of transaction participants cannot be effectively detected.

The method establishes a brand new data management mode by means of a block chain technology, and has the core idea that original data does not depart from the control range of a data provider, data analysis is completed by the data provider, and only analysis result data is sent. To this end, the present method discusses the following problems: 1) for effectively discovering data, a data provider index establishing mechanism is discussed, wherein the data provider index establishing mechanism comprises a metadata extraction mechanism and a domain index establishing mechanism; 2) the data transaction based on the block chain is analyzed, and the transaction record format and the consensus mechanism are included, so that transparent transaction, collusion prevention and cheating prevention of data and the like are realized; 3) and according to the calculation requirements of the data demanders, the safety calculation of privacy protection is realized among the data suppliers in an intelligent contract mode. Experimental analysis results show that the method provides a sharing mechanism which is feasible and safe.

Disclosure of Invention

The invention aims to provide a data sharing method research based on a block chain.

The technical scheme adopted by the invention for solving the technical problem is as follows:

a data sharing method based on a block chain comprises a data set index establishing step, a data set retrieving step, a data requirement contract compiling step, a data trading step and a data safety calculating step, and specifically comprises the following steps:

1) establishing a data set index: according to a domain index establishing mechanism, a data set is segmented according to domains, a domain index is formed on the basis of confirming, limiting or expanding the domains and domain values, and the index is optimized and stored;

and (3) metadata extraction: extracting domains (metadata or attributes) in the data set according to a specific rule, determining domain values and limiting or expanding the range of the domain values according to the domain sizes;

the method judges whether the data sets are the same or not based on Jaccard similarity, and the calculation formula is that if X and Y are metadata items of the data sets X and Y respectively, the Jaccard similarity is as follows:

logically dividing the domains into a plurality of groups, and storing the domain indexes in nodes adjacent to the hash values of the domains according to the adjacent relation of the LSH values of the domains and the hash values of all nodes;

2) and (3) data set retrieval step: forming a query domain according to the query requirement of a data demander, and retrieving a required data set from the domain index in the step 1);

by utilizing a domain index retrieval mechanism, a domain similarity search technology is adopted to query the coincidence degree of the terms and the indexes to obtain a retrieval data set;

if Q is the query domain and I is the index domain, the domain similarity can be represented by formula (2);

3) a data requirement contract writing step: compiling a data ordering contract according to the characteristics of the data set obtained in the step 2) and the requirements of the data demand side;

4) data transaction and evaluation mechanism: paying a certain fee to a system by a data demand party according to the value strategy and the balance price demand of the provider of the data set obtained in the step 2) to compensate the data provider and related participants; transaction refers to the processing logic of data, and the consensus mechanism of the method adopts an improved authorized Byzantine fault-tolerant algorithm (dBFT); setting the number of nodes in the network as N, numbering each participated node from 0 to N-1 in sequence, and arranging the participated nodes in a descending order according to the reliability trust, and taking N nodes as consensus nodes; the height of the current consensus block is h, and the transaction number is v; the positive evaluation number and the negative evaluation number p, f of the two transaction parties can be calculated by the formulas (3) and (4):

p＝(h-v)modn (3)

calculating the credit degree of the node n in the ith transaction according to the formula (5) by the positive evaluation number and the negative evaluation number;

5) data security calculation and privacy protection mechanisms: on the basis of completing the system payment in the step 4), the system node completes the safe calculation of the data and obtains the output meeting the privacy requirement;

data connection and sharing: records with the same key in different data sets are merged together and shared secretly among parties participating in multi-party computing, and pi is set_SFor a pseudo-random permutation cluster, where the secret key s uniquely determines a particular permutation, the algorithm is as follows:

inputting: secret shared data table Ti, k_iRepresenting the primary key column in the table

And (3) outputting: equivalent connection table T shared by secret of each input table^*

a. Randomly disorganizing each database table Ti by each calculation party, and using Ti^*The data table after disorder is a main key column after disorder

b. Selecting a random permutation function pi using a secret shared key s between parties_s

c. Each party in turn using a permutation function pi_sEvaluating query primary key columns

And will take value

Sequentially transmitting to a subsequent calculator; each calculator is connected with the result sent by the previous calculator in sequence, and finally, a result table T is generated^*

Sorting: according to the generated result table T^*Determining the order of the vectors; the essence is random ordering, setting n computation party shared secret vector x₁,x₂,...,x_nWithout loss of generality, use]Representing the secret sharing vector, the sharing vector is represented as [ x ]₁]，[x₂]，……，[x_n]。

Drawings

FIG. 1 dataset retrieval accuracy

FIG. 2 data set retrieval recall

FIG. 3 statistics of internet access volume in one day

FIG. 4 data comparison of time consumption

Detailed Description

In order to evaluate the system performance, a prototype system was constructed. The performance of the core module of the experiment display system mainly comprises: index performance, transaction blockchain network performance, and security computation performance.

1) Description of the Experimental Environment

The experiment was performed on 5 ubuntu cloud servers, each configured as follows: the processor adopts Intel (R) Xeon (R)2.0G and double CPUs; the memory is 2 GB; the operation uses Ubuntu 14.04.5 LTS server with kernel version number 4.4.0-31-genetic. The 5 servers are named node1, node2, … and node5 respectively, wherein node4 and node5 store original data sets, which are data providers and participants of secure computing. Any one of the nodes 1-3 can be used as a data demand side. All nodes are candidate nodes that are common to the block.

The software implementation aspect depends on an open source software architecture, wherein an index part is constructed according to minHash LSH Ensemble; the block chain adopts ethereum technology, and modifies the block structure and the consensus mechanism; the obliv-c technology and programming language are used for reference in multi-party computing.

2) Description of the Experimental samples

The method uses python to generate two main test data sets (education data set and tax data set) and some interference data sets (for measuring index and retrieval performance). The attributes of the educational data set include: personal ID, course ID, gender, year and month of birth, level of school calendar (scholar, master, doctor, higher vocational education), length of course prescription, school attended, date of admission, learning status (in progress, back school, graduation), graduation date/back school date. Attributes of the tax data set include personal ID, year, month, social security fee paid, dividend income, whether the employer is from ICT (or ITL).

3) Index performance

Recall and precision are two important indexes of a search engine, and the other important index, namely the F value, is dependent on the recall and precision and is also an important index for measuring the system. The results of comparing the accuracy and recall rate of data set retrieval according to different similarity threshold values t and a local sensitive hash integration (LSH ensemble) algorithm are shown in fig. 1 and 2. As can be seen from fig. 1 and 2, the algorithm of the present disclosure adds prefix information to the index, so that the accuracy and recall rate are improved to some extent.

4) Block chaining network performance

The block-out speed is a hard limit for most blockchain networks, e.g. the block-out speed for a bitcoin is about one block every 10 minutes. Under the condition that the block size (such as a bitcoin network is 1MB) is constant, the block output speed determines the number of transactions which can be processed in unit time, and obviously, the block output speed is an important factor influencing the real-time performance of the system and the network growth rate. In the dFT consensus algorithm, the time interval, the number of upper bound blocks, and the number of network nodes are all related to the size of the block.

Because the transaction times in different periods have randomness, the transaction amount of the system is simulated by borrowing the internet access probability, and fig. 3 is a statistical chart obtained according to the access periods according to certain internet. As can be seen from fig. 3, there is often an irrational factor simply regarding the transaction time interval as a consensus. Therefore, a limit on the number of transaction blocks is added to the consensus algorithm to increase the peak transaction frequency.

5) Secure computing performance

In order to ensure the safety of data, the performance of the connection query mode with high efficiency is replaced by sub-queries, and the connection query mode is verified through experiments, as shown in fig. 4. As can be seen from the figure, in order to improve the safety of the data, the algorithm provided by the method increases the running time to a certain extent.

Claims

1. A data sharing method based on a block chain is characterized by comprising the steps of data set index establishment, data set retrieval, data requirement contract compiling, data transaction and data security calculation, and specifically comprises the following steps:

s1: establishing a data set index: according to a domain index establishing mechanism, a data set is segmented according to domains, a domain index is formed on the basis of confirming, limiting or expanding the domains and domain values, and the index is optimized and stored;

s2: and (3) data set retrieval step: forming a query domain according to the query requirement of the data demander, and retrieving a required data set from the domain index of the step S1;

s3: a data requirement contract writing step: compiling a data ordering contract according to the characteristics of the data set obtained in the step S2 and the requirements of the data demander;

s4: data transaction and evaluation mechanism: paying a certain fee to the system to compensate the data provider and the relevant participants according to the value strategy and the balance price demand of the provider of the data set obtained in the step S2;

s5: data security calculation and privacy protection mechanisms: on the basis of completing the system payment S4, the system node completes the secure computation of the data and obtains an output meeting the privacy requirements.

2. The method according to claim 1, wherein the step S1 domain index establishing mechanism comprises:

s11: and (3) metadata extraction: extracting domains (metadata or attributes) in the data set according to a specific rule, determining domain values and limiting or expanding the range of the domain values according to the domain sizes;

s12: the domains are logically divided into a plurality of groups, and domain indexes are stored in nodes adjacent to the hash values of the domains according to the adjacent relation of the LSH values of the domains and the hash values of all nodes.

3. The method according to claim 1, wherein the step S2 is a step of retrieving the data set, in which a domain index retrieval mechanism is used, and a domain similarity search technique is used to query the degree of overlap between the term and the index to obtain a retrieved data set;

assuming that Q is the query domain and I is the index domain, the domain similarity can be represented by t (Q, I).

4. The method according to claim 1, wherein the step S4 is a data transaction and evaluation mechanism, the transaction refers to processing logic of data, and the consensus mechanism of the method employs a modified authorized byzantine fault-tolerant algorithm (dBFT); setting the number of nodes in the network as N, numbering each participated node from 0 to N-1 in sequence, and arranging the participated nodes in a descending order according to the reliability trust, and taking N nodes as consensus nodes; the height of the current consensus block is h, and the transaction number is v; the positive evaluation number and the negative evaluation number p, f of the two transaction parties can be calculated by the formulas (3) and (4):

p＝(h-v)mod n (3)

and (5) calculating the credit degree of the node n in the ith transaction according to the formula (5) by the positive evaluation number and the negative evaluation number.

5. The blockchain-based data sharing method according to claim 1, wherein the step S5 includes a data security calculation and privacy protection mechanism,

s51: data connection and sharing: records with the same key in different data sets are merged together and shared secretly among parties participating in multi-party computing, and pi is set_SFor a pseudo-random permutation cluster, where the secret key s uniquely determines a particular permutation, the algorithm is as follows:

And will take value

S52: sorting: according to the generated result table T^*Determining the order of the vectors; the essence is random ordering, setting n computation party shared secret vector x₁,x₂,...,x_nWithout loss of generality, use]Representing the secret sharing vector, the sharing vector is represented as [ x ]₁],[x₂],......,[x_n]。