CN112836005B

CN112836005B - Cipher text sequencing search method and system based on PCA

Info

Publication number: CN112836005B
Application number: CN201911167134.1A
Authority: CN
Inventors: 刘良桂; 刘政金
Original assignee: Zhejiang Shuren University
Current assignee: Zhejiang Shuren University
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2022-05-17
Anticipated expiration: 2039-11-25
Also published as: CN112836005A

Abstract

The invention discloses a ciphertext sequencing search method and system based on PCA. The dimensionality of the keyword index matrix is reduced by utilizing the PCA algorithm, so that the dimensionality of the key is reduced, and the data encryption speed and the searching efficiency are greatly improved. Aiming at the privacy invasion behaviors of an unauthorized user and an untrusted server, a reversible matrix encryption method is adopted to protect data information, and on the basis, a method for randomly setting a threshold value is provided, so that the randomness of data dimension reduction is realized, and the data security is further improved. In addition, the invention introduces the unit matrix before the dimensionality reduction of the index matrix through the deep research of the PCA algorithm principle, so that the components of the query vector also participate in the dimensionality reduction process, thereby not only improving the safety, but also ensuring the query precision.

Description

Cipher text sequencing search method and system based on PCA

Technical Field

The invention relates to the technical field of cloud computing and network security, in particular to a ciphertext sequencing search method and system based on PCA.

Background

The rise of the cloud storage technology leads the development of the computer technology to a new step, enterprises and users can store huge data on a cloud server, the limitation of a local storage space is avoided, and meanwhile, the enterprises and the users can enjoy faster and more convenient services. For example, saving local storage space, realizing information sharing, processing information by using a fast cloud service, and the like. However, the cloud storage technology brings convenience to users, and meanwhile, some data safety hidden dangers also exist. For example, personal information of the user, confidential documents of the enterprise and other sensitive private information are also easily extracted and leaked by the server. Therefore, in order to protect privacy and security, data owner must perform encryption processing before storing the data in the cloud. Although attacks of illegal users, unauthorized users and untrusted cloud servers can be prevented after data encryption processing, privacy disclosure is prevented, the search difficulty in a large amount of encrypted cloud data is greatly increased, and the search efficiency is also sharply reduced because the encrypted data is not convenient to retrieve like plaintext data. Therefore, research on ciphertext-sorted search schemes is imminent.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a ciphertext sorting and searching method based on PCA, which reduces the computation overhead and realizes efficient ciphertext search on a remote server on the premise of protecting the privacy of a user.

The technical scheme adopted by the embodiment of the invention is as follows:

one aspect of the embodiments of the present invention is to provide a ciphertext sorting search method based on PCA, which includes:

(1) extracting keywords from the document, calculating a standardized word frequency, establishing a keyword index, reducing the dimension of an index matrix through a PCA algorithm, encrypting the index after dimension reduction, and uploading the index to a server;

(2) a data user inputs keywords, a query vector is established, dimensionality reduction is carried out on the query vector through a PCA algorithm, and a trapdoor is generated;

(3) and the data user sends a request to the server, and the cloud server returns the sequencing result to the user through computing and searching operation after receiving the request.

Further, in the step (1), the step of extracting the keywords from the document specifically includes:

when a data owner processes data, firstly extracting keywords of each document to generate a document keyword set, then summarizing the keywords of all documents to generate a non-repeated keyword dictionary, wherein the number of the keywords of the dictionary is n.

Further, in the step (1), the normalized word frequency is calculated, the keyword index is established, then the dimensionality of the index matrix is reduced through a PCA algorithm, the index after the dimensionality reduction is encrypted and uploaded to a server, and the method specifically comprises the following steps:

(1.1) creating an initial index

Calculating the normalized word frequency of each keyword in the dictionary in the document i to generate an index D_i，D_iIs denoted by D_i＝(d_i1,d_i2,…,d_ik,…,d_in)^TWherein d is_ikIs the normalized word frequency of the k-th keyword in the corresponding dictionary in the ith document, so that the m documents form an initial index matrix D (D) with m × n dimensions₁,D₂,…,D_m)^T；

(1.2) introducing an identity matrix

Adding an n × n identity matrix E after the initial index matrixWhich are combined into a matrix F of (m + n) × n dimensions, where F ═ D₁,D₂,…,D_m,E)^T。

(1.3) random dimensionality reduction of initial index

And randomly setting a threshold value within a certain range, analyzing principal components of the matrix F according to a PCA algorithm, removing similar keywords to obtain a principal component matrix R with dimension of n multiplied by k, and reducing the dimension of the matrix F to ensure that F 'is FR and obtain a matrix F' with dimension of (m + n) multiplied by k. Then, the following n-dimensional data is deleted to obtain a matrix D ' of m × k dimensions, which is expressed as D ' ═ D '₁,D′₂,…,D′_i,…,D′_m)^TWherein D'₁,D′₂,…,D′_i,…,D′_mIs a column vector of dimension k;

(1.4) generating a Key

The data owner randomly generates a reversible matrix M of two (k + u +1) × (k + u +1) dimensions₁,M₂As a key, and a vector S of dimension k + u +1 as a division indicator, which are denoted as S ═ S (S)₁,s₂,…,s_i,…,s_k) Where S is {0,1}^(k+u+1)(j ═ 1,2, …, k), k being the dimension of each row vector in the reduced-dimension matrix, u +1 being the extended dimension;

(1.5) index dimension extension

For each vector D 'in D'_iPerforming dimension expansion from the dimension k to the dimension k + u +1 to obtain a matrix of mx (k + u +1)

Wherein each column vector is D'_iThe first k dimension of (c) is kept constant, the last dimension is set to a constant of 1, and the k +1 dimension to the k + u dimension are set to any random number epsilon_i(ii) a Expanded column vector

Is shown as

Expanded matrix

Is shown as

Wherein epsilon_i1,ε_i2,…,ε_iuRepresent arbitrary random numbers and they obey the same uniform distribution U (μ' -c );

(1.6) index random partitioning

Each column vector is vectored according to the value of the indicator vector S

Is divided into

And

the segmentation rule is as follows: if S [ n ]]Is equal to 0, then

If S [ n ]]Equal to 1, will

And

set to two non-equal and non-zero random numbers, and their sum equals

(1.7) index encryption

Using a secret key M₁,M₂For the index after division

And

encrypting to obtain the final encrypted index of the ith document

(1.8) the data owner uploads an encrypted fileset C and an encrypted index set I to the cloud server, where I ═ I (I ═ I)₁,I₂,…,I_i,…,I_m)。

Further, the step (2) is specifically as follows:

(2.1) creating a query vector

When a data user inquires, firstly inputting key words, and then carrying out synonym and near synonym expansion on the key words by a program to generate a query vector q; each element of the query vector corresponds to n keywords, denoted q ═ q (q)₁,q₂,…,q_i,…,q_n) If the input keyword matches with the keyword in the keyword dictionary, q is_iIs 1; otherwise, the value is 0;

(2.2) vector dimensionality reduction

The query vector q is dimensionality reduced such that q ' is qR, and q is reduced from n dimension to k dimension, denoted as q ' ═ q '₁,q′₂,…,q′_i,…,q′_k)；

(2.3) query vector dimension expansion

And (3) performing dimension expansion on q', from the dimension k to the dimension k + u +1, wherein the expansion rule is as follows: randomly selecting v dimension from the k +1 dimension to the k + u dimension of q' to be set as 1, setting the other dimensions as 0, multiplying the k + u dimension by a non-zero random number r, and then setting the k + u +1 dimension as a random number t; the expanded query vector is represented as

(2.4) query vector random partitioning

Expanding the query vector according to the value of the indicator vector S

Randomly divided into two vectors

And

are respectively represented as

The segmentation rule is as follows: if S [ n ]]Equal to 0, will

And

set to two non-equal and non-zero random numbers, and their sum equals

If S [ n ]]Is equal to 1, then

(2.5) generating trapdoors

By reversal of the secret key

And

for query vector

And

the encryption generates a trapdoor T that,

further, the step (3) is specifically as follows:

(3.1) submitting the generated trap door to a cloud server by a data user for query;

(3.2) after the cloud server receives the trap door, calculating the inner product of the index and the trap door, sequencing the inner product in a descending order, and then returning k encrypted documents with higher scores to a data user; the inner product is calculated as follows:

a second aspect of the embodiments of the present invention provides a ciphertext search system based on PCA, including:

the index establishing module is used for extracting keywords from the document, calculating the standardized word frequency, establishing a keyword index, reducing the dimension of an index matrix through a PCA algorithm, encrypting the index after dimension reduction, and uploading the index to a server;

the trap door creating module is used for creating a query vector according to the key words input by a data user, reducing the dimension of the query vector through a PCA algorithm and generating a trap door;

and the query module is used for sending a request to the server by a data user, and returning the sequencing result to the user through computing and searching operation after the request is received by the cloud server.

Further, the step of extracting the keywords from the document is specifically as follows:

Further, the index establishing module includes:

creating an initial index unit for calculating the normalized word frequency of each keyword in the dictionary in the document i to generate an index D_i，D_iIs denoted by D_i＝(d_i1,d_i2,…,d_ik,…,d_in)^TWherein d is_ikIs the normalized word frequency of the k-th keyword in the corresponding dictionary in the ith document, so that the m documents form an initial index matrix D (D) with m × n dimensions₁,D₂,…,D_m)^T；

Introducing an identity matrix unit for adding an n × n identity matrix E after the initial index matrix and merging the n × n identity matrix E into a matrix F with (m + n) × n dimensions, wherein F ═ D₁,D₂,…,D_m,E)^T。

An initial index random dimension reduction unit, configured to randomly set a threshold value within a certain range, analyze principal components of the matrix F according to a PCA algorithm, remove similar keywords, obtain a principal component matrix R of n × k dimensions, then reduce the dimension of the matrix F such that F ' is FR, obtain a matrix F ' of (m + n) × k dimensions, then delete the following n-dimensional data, obtain a matrix D ' of m × k dimensions, which is expressed as D ' ═ D '₁,D′₂,…,D′_i,…,D′_m)^TWherein D'₁,D′₂,…,D′_i,…,D′_mIs a column vector of dimension k;

a key generation unit for randomly generating a reversible matrix M of two (k + u +1) × (k + u +1) dimensions by a data owner₁,M₂As a key, and a vector S of dimension k + u +1 as a division indicator, which are denoted as S ═ S (S)₁,s₂,…,s_i,…,s_k) Where S is {0,1}^(k+u+1)(j ═ 1,2, …, k), k being the dimension of each row vector in the reduced-dimension matrix, u +1 being the extended dimension;

an index dimension extension unit for extending each vector D 'in D'_iPerforming dimension expansion from the dimension k to the dimension k + u +1 to obtain a matrix of mx (k + u +1)

Wherein each column vector is D'_iThe first k dimension of (1) is kept constant, the last dimension is set as a constant 1, and the (k +1) th dimension to the (k + u) th dimension are set as arbitrary random numbers epsilon_i(ii) a Expanded column vector

Is shown as

Expanded matrix

Is shown as

an index random division unit for dividing each column vector according to the value of the indication vector S

Is divided into

And

the segmentation rule is as follows: if S [ n ]]Is equal to 0, then

If S [ n ]]Equal to 1, will

And

set to two random numbers which are not equal and not zero, and whose sum is equal to

An index encryption unit for encrypting the index by using a key M₁,M₂For the index after division

And

encrypting to obtain the final encrypted index of the ith document

An uploading unit, configured to upload, by a data owner, an encrypted document set C and an encrypted index set I to a cloud server, where I ═ I (I ═ I)₁,I₂,…,I_i,…,I_m)。

Further, the creating a trapdoor module comprises:

the method comprises the steps of establishing a query vector unit, wherein the query vector unit is used for firstly inputting key words when a data user queries, and then carrying out synonym and near synonym expansion on the key words by a program to generate a query vector q; each element of the query vector corresponds to n keywords, denoted q ═ q (q)₁,q₂,…,q_i,…,q_n) If the input keyword matches with the keyword in the keyword dictionary, q is_iIs 1; otherwise, the value is 0;

a vector dimension reduction unit for reducing the query vector q so that q ' is qR, reducing q from n dimension to k dimension, denoted as q ' — (q '₁,q′₂,…,q′_i,…,q′_k)；

A query vector dimension expansion unit, configured to perform dimension expansion on q', from the k dimension to the k + u +1 dimension, according to the following expansion rule: randomly selecting v dimension from the k +1 dimension to the k + u dimension of q' to be set as 1, setting the other dimensions as 0, multiplying the k + u dimension by a non-zero random number r, and then setting the k + u +1 dimension as a random number t; the expanded query vector is represented as

A query vector random division unit for dividing the expanded query vector according to the value of the indication vector S

Randomly divided into two vectors

And

are respectively represented as

The segmentation rule is as follows: if S [ n ]]Equal to 0, will

And

set to two non-equal and non-zero random numbers, and their sum equals

If S [ n ]]Is equal to 1, then

Generating a trapdoor unit for inversion with a secret key

And

for query vector

And

the encryption generates a trapdoor T that,

further, the query module comprises:

the submitting unit is used for submitting the generated trapdoor to a cloud server for query by a data user;

the query returning unit is used for calculating the inner product of the index and the trapdoors and sequencing the inner product in a descending order after the cloud server receives the trapdoors, and then returning k encrypted documents with higher scores to the data user; the inner product is calculated as follows:

the embodiment provided by the invention has the following beneficial effects:

the dimensionality of the keyword index matrix is reduced by utilizing the PCA algorithm, so that the dimensionality of the key is reduced, and the data encryption speed and the searching efficiency are greatly improved. Aiming at the privacy invasion behaviors of an unauthorized user and an untrusted server, a reversible matrix encryption method is adopted to protect data information, and on the basis, a method for randomly setting a threshold value is provided, so that the randomness of data dimension reduction is realized, and the data security is further improved. In addition, the invention introduces the unit matrix before the dimensionality reduction of the index matrix through the deep research of the PCA algorithm principle, so that the components of the query vector also participate in the dimensionality reduction process, thereby not only improving the safety, but also ensuring the query precision.

To improve security, the trapdoor is regenerated each time it is queried. This experiment tested the variation of the trap door time with the number of documents in the FDRQM scheme at different thresholds as shown in fig. 2, and compared to the MRSE scheme as shown in fig. 3. By

It can be known that the time for acquiring the trapdoor is related to the dimension of the key and the dimension of the query vector, and the dimension of the key and the query vector is smaller and the time for creating the trapdoor is shorter as the dimension reduction amplitude is larger and the threshold is smaller. The experiment compares the FDRQM scheme with the MRSE scheme, and as can be seen from FIG. 3, the FDRQM scheme with the threshold of 0.95 is also much better than the MRSE scheme in terms of acquiring the trapdoors

In the experiment, the query time of the FDRQM scheme with the threshold of 0.95 is compared with the query time of the MRSE scheme, and the change of the query time with the number of documents is shown in fig. 4, and the change of the query time with the number of keywords is shown in fig. 5. Since the dimension of the key increases with the increase of the number of documents and the number of keywords, the curves of both schemes are in an ascending trend. However, the query time of the MRSE scheme is rapidly increased, and compared with the MRSE scheme, the increase of the number of the documents and the number of the keywords has little influence on the query time of the FDRQM scheme.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a model of a ciphertext ordering search system;

FIG. 2 shows the variation of the trap door time created by the FDRQM scheme with the number of documents under different thresholds;

FIG. 3 illustrates the variation of trapdoor time with the number of documents for the MRSE scheme and the FDRQM scheme with a threshold of 0.95;

FIG. 4 query time as a function of number of documents;

FIG. 5 query time as a function of number of keywords.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the invention provides a ciphertext sequencing search method based on PCA, which comprises the following steps:

(1) extracting keywords from the document, calculating a standardized word frequency, establishing a keyword index, reducing the dimension of an index matrix through a PCA (principal component analysis) algorithm, encrypting the index after dimension reduction, and uploading the index to a server;

firstly, the steps of extracting keywords from the document are as follows: when a data owner processes data, keywords of each document are extracted firstly to generate a document keyword set, then keywords of all documents are gathered to generate a non-repeated keyword dictionary, and the number of the dictionary keywords is n. The embodiment of the invention uses the RFC document as a data set to carry out encryption query. Request For Comments (RFC), is a series of files that are arranged by number. The files collect information about the internet, and software files for UNIX and the internet community. RFC documents are currently published under the auspices of the Internet Society (ISOC). The basic internet communication protocol is specified in the RFC document. The RFC document also adds a number of additional topics within the standard, such as all records in the development and protocols newly developed for the internet. Almost all internet standards are therefore incorporated in RFC documents.

(1.1) creating an initial index

Calculating the normalized word frequency of each keyword in the dictionary in the document i to generate an index D_i。D_iIs denoted by D_i＝(d_i1,d_i2,…,d_ik,…,d_in)^TWherein d is_ikIs the normalized word frequency of the k-th keyword in the corresponding dictionary in the ith document, so that the m documents form an initial index matrix D (D) with m × n dimensions₁,D₂,…,D_m)^T。

(1.2) introducing an identity matrix

An n × n identity matrix E is added after the initial index matrix, and the n × n identity matrices E are combined into a matrix F of (m + n) × n dimensions, where F ═ D₁,D₂,…,D_m,E)^T。

(1.3) random dimensionality reduction of initial index

And randomly setting a threshold value within a certain range, analyzing principal components of the matrix F according to a PCA algorithm, removing similar keywords to obtain a principal component matrix R with dimension of n multiplied by k, and reducing the dimension of the matrix F to ensure that F 'is FR and obtain a matrix F' with dimension of (m + n) multiplied by k. The following n-dimensional data is then deleted, resulting in an m × k-dimensional matrix D ', denoted as D' ═ D (D)₁′,D′₂,…,D′_i,…,D′_m)^TWherein D'₁,D′₂,…,D′_i,…,D′_mIs a column vector of dimension k.

(1.4) generating a Key

The data owner randomly generates a reversible matrix M of two (k + u +1) × (k + u +1) dimensions₁,M₂As a key, and a vector S of dimension k + u +1 as a division indicator, which are denoted as S ═ S (S)₁,s₂,…,s_i,…,s_k). Where S ∈ {0,1}^(k+u+1)(j ═ 1,2, …, k), k being the dimension of each row vector in the reduced-dimension matrix, and u +1 being the extended dimension.

(1.5) index dimension extension

Wherein each column vector is D'_iThe first k dimension of (1) is kept constant, the last dimension is set as a constant 1, and the (k +1) th dimension to the (k + u) th dimension are set as arbitrary random numbers epsilon_i. Expanded column vector

Is shown as

Expanded matrix

Is shown as

Wherein epsilon_i1,ε_i2,…,ε_iuRepresent arbitrary random numbers and they obey the same uniform distribution U (μ' -c ).

(1.6) index random partitioning

Each column vector is vectored according to the value of the indicator vector S

Is divided into

And

the segmentation rule is as follows: if S [ n ]]Is equal to 0, then

If S [ n ]]Equal to 1, will

And

set to two non-equal and non-zero random numbers, and their sum equals

(1.7) index encryption

Using a secret key M₁,M₂For the index after division

And

encrypting to obtain the final encrypted index of the ith document

(1.8) uploading the encryption document set C and the encryption index set I to a cloud server by a data owner. Wherein I ═ I (I)₁,I₂,…,I_i,…,I_m)。

(2) A data user inputs keywords, a query vector is established, dimensionality reduction is carried out on the query vector through a PCA algorithm, and a trapdoor is generated; the method comprises the following specific steps:

(2.1) creating a query vector

When a data user inquires, firstly, a keyword is input, and then a program carries out synonym and near-synonym expansion on the keywordAnd (5) unfolding to generate a query vector q. Each element of the query vector corresponds to n keywords, denoted q ═ q (q)₁,q₂,…,q_i,…,q_n) If the input keyword matches with the keyword in the keyword dictionary, q is_iIs 1; otherwise it is 0.

(2.2) vector dimensionality reduction

The query vector q is dimensionality reduced such that q ' is qR, and q is reduced from n dimension to k dimension, denoted as q ' ═ q '₁,q′₂,…,q′_i,…,q′_k)。

(2.3) query vector dimension expansion

And (3) performing dimension expansion on q', from the dimension k to the dimension k + u +1, wherein the expansion rule is as follows: the v dimension is randomly selected from the (k +1) th dimension to the (k + u) th dimension of q' to be set to 1, the remaining dimensions are set to 0, the (k + u) th dimension is multiplied by a non-zero random number r, and then the (k + u +1) th dimension is set to be a random number t. The expanded query vector is represented as

(2.4) query vector random partitioning

Expanding the query vector according to the value of the indicator vector S

Randomly divided into two vectors

And

are respectively represented as

The segmentation rule is as follows: if S [ n ]]Equal to 0, will

And

set to two non-equal and non-zero random numbers, and their sum equals

If S [ n ]]Is equal to 1, then

(2.5) generating trapdoors

By reversal of the secret key

And

for query vector

And

encryption generates trapdoors T.

(3) The data user sends a request to the server, and the cloud server returns the sequencing result to the user through computing and searching operations after receiving the request, wherein the method specifically comprises the following steps:

the embodiment of the invention also provides a system corresponding to the method, namely a ciphertext sequencing search system based on PCA, which comprises the following steps:

Further, the index establishing module includes:

An initial index random dimension reduction unit for randomly setting a threshold value in a certain range, analyzing the principal component of the matrix F according to a PCA algorithm, removing similar keywords to obtain a principal component matrix R with dimension of nxk, and then reducing the matrix FDimension, so that F 'becomes FR, a matrix F' of (m + n) × k dimensions is obtained, and then the following n-dimensional data is deleted to obtain a matrix D 'of m × k dimensions, expressed as D' ═ D (D)₁′,D′₂,…,D′_i,…,D′_m)^TWherein D'₁,D′₂,…,D′_i,…,D′_mIs a column vector of dimension k;

an index dimension expanding unit for expanding each vector D in D_i' dimension expansion from the k dimension to the k + u +1 dimension, resulting in a matrix of mx (k + u +1)

Wherein each column vector D_i' the front k dimension remains unchanged, the last dimension is set to a constant 1, and the k +1 to k + u dimensions are set to any random number ε_i(ii) a Expanded column vector

Is shown as

Expanded matrix

Is shown as

an index random division unit for dividing each column according to the value of the indication vector S(Vector)

Is divided into

And

the segmentation rule is as follows: if S [ n ]]Is equal to 0, then

If S [ n ]]Equal to 1, will

And

set to two non-equal and non-zero random numbers, and their sum equals

Index encryption unit using key M₁,M₂For the index after division

And

encrypting to obtain the final encrypted index of the ith document

Further, the creating a trapdoor module comprises:

creating query vector unit for data user to queryFirstly, inputting key words, and then carrying out synonym and near synonym expansion on the key words by a program to generate a query vector q; each element of the query vector corresponds to n keywords, denoted q ═ q (q)₁,q₂,…,q_i,…,q_n) If the input keyword matches with the keyword in the keyword dictionary, q is_iIs 1; otherwise, the value is 0;

Randomly divided into two vectors

And

are respectively represented as

The segmentation rule is as follows: if S [ n ]]Equal to 0, will

And

set to two non-equal and non-zero random numbers, and their sum equals

If S [ n ]]Is equal to 1, then

Generating a trapdoor unit for inversion with a secret key

And

for query vector

And

the encryption generates a trapdoor T that,

further, the query module comprises:

the ciphertext sequencing search system mainly has three roles: data owner, cloud server and data consumer. The relationship between the three is shown in figure 1.

The method comprises the following steps that a data owner firstly extracts document keywords, creates a keyword index of each document, encrypts the keyword index and document information by using a key, and uploads the encrypted index and the encrypted document to a cloud server, wherein the cloud server does not know plaintext information of the index and does not have access to the content of the encrypted document; the method comprises the steps that authorized data users input keywords to generate query vectors, trapdoors are obtained through safety control, the generated trapdoors are submitted to a cloud server, after the cloud server receives a search request, the safe inner product of the trapdoors and index vectors of all documents is calculated, so that the keyword score of each document is obtained, then descending sorting is carried out according to the scores, and the top k encrypted documents in the sorting sequence are returned to the data users. After receiving the data, the data user obtains the key of the encrypted document through access control, and decrypts the document.

In order to better evaluate the reliability of the encryption algorithm, the attack categories of the cloud server are divided into different levels according to the acquired information. The embodiment of the invention adopts the following attack models:

and in the level 1, the cloud server can observe an encrypted data set C and an encrypted index set I uploaded by a data owner and a query trapdoor T submitted by a data user.

And 2, on the basis of the level 1, the cloud server can acquire more information, for example, the cloud server judges the relevance of the query trapdoor by combining the existing trapdoor and a query result, or analyzes the encryption process, and reversely deduces an encryption key by using encrypted background information.

Since the relevance of different keywords in different documents is different. To reflect the importance of each keyword to different documents, document scores are introduced herein. The document score is the basis for ranking and returning search results. The invention adopts word frequency and anti-word frequency (tf. idf) to calculate the document score. The word frequency represents the occurrence frequency of the keywords in the document, and the more the occurrence frequency is, the more important the keywords are to the document; the anti-word frequency represents the number of documents containing the keywords, and the more documents containing the keywords, the lower the distinguishing degree of the keywords to the documents. Document H at the time of submission of query Q is computed herein using tf · idf normalized in equation (1)_iIs scored.

Wherein f is_i,bRepresenting a keyword w_bIn document H_iThe number of times of occurrence of (a),

presentation document H_iContaining a set of keywords, f_iRepresenting the number of documents containing the keyword, and m represents the total number of documents. In creating the index, each dimension of the vector is set to the product of the normalized word frequency and the normalized anti-word frequency of the corresponding keyword.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A ciphertext sorting search method based on PCA is characterized by comprising the following steps:

(3) the data user sends a request to the server, and the cloud server receives the request and then returns the sequencing result to the user through calculation and search operation;

in the step (1), the step of extracting the keywords from the document is specifically as follows:

when a data owner processes data, firstly extracting keywords of each document to generate a document keyword set, then summarizing the keywords of all documents to generate a non-repeated keyword dictionary, wherein the number of the keywords of the dictionary is n;

in the step (1), the normalized word frequency is calculated, the keyword index is established, then the dimensionality of the index matrix is reduced through a PCA algorithm, the index after dimensionality reduction is encrypted and uploaded to a server, and the method specifically comprises the following steps:

(1.1) creating an initial index

(1.2) introducing an identity matrix

An n × n identity matrix E is added after the initial index matrix, and the n × n identity matrices E are combined into a matrix F of (m + n) × n dimensions, where F ═ D₁,D₂,…,D_m,E)^T；

(1.3) random dimensionality reduction of initial index

Randomly setting a threshold value within a certain range, analyzing principal components of a matrix F according to a PCA algorithm, removing similar keywords to obtain a principal component matrix R with dimension of n multiplied by k, and then reducing the dimension of the matrix F to ensure that F 'is FR and obtain a matrix F' with dimension of (m + n) multiplied by k; the following n-dimensional data is then deleted, resulting in an m × k-dimensional matrix D ', denoted as D' ═ D (D)₁′,D₂′,…,D_i′,…,D′_m)^TWherein D'₁,D′₂,…,D′_i,…,D′_mIs a column vector of dimension k;

(1.4) generating a Key

(1.5) index dimension extension

Is shown as

Expanded matrix

Is shown as

(1.6) index random partitioning

Each column vector is indexed according to the value of the vector S

Is divided into

And

the segmentation rule is as follows: if S [ n ]]Is equal to 0, then

If S [ n ]]Equal to 1, will

And

set to two non-equal and non-zero random numbers, and their sum equals

(1.7) index encryption

Using a secret key M₁,M₂For the index after division

And

encrypting to obtain the final encrypted index of the ith document

2. The ciphertext ordering search method based on PCA according to claim 1, wherein the step (2) is specifically as follows:

(2.1) creating a query vector

(2.2) vector dimensionality reduction

(2.3) query vector dimension expansion

(2.4) query vector random partitioning

Expanding the query vector according to the value of the indicator vector S

Randomly divided into two vectors

And

are respectively represented as

The segmentation rule is as follows: if S [ n ]]Equal to 0, will

And

set to two non-equal and non-zero random numbers, and their sum equals

If S [ n ]]Is equal to 1, then

(2.5) generating trapdoors

By reversal of the secret key

And

for query vector

And

the encryption generates a trapdoor T that,

3. the ciphertext ordering search method based on PCA according to claim 2, wherein the step (3) is specifically as follows:

4. a ciphertext sorted search system based on PCA, comprising:

the query module is used for sending a request to the server by a data user, and the cloud server returns a sequencing result to the user through computing and searching operation after receiving the request;

the steps of extracting the keywords from the document are as follows:

when a data owner processes data, firstly extracting keywords of each document to generate a document keyword set, then summarizing the keywords of all documents to generate a non-repeated keyword dictionary, wherein the number of the dictionary keywords is n;

the index establishing module comprises:

Introducing an identity matrix unit for adding an n × n identity matrix E after the initial index matrix and merging the n × n identity matrix E into a matrix F with (m + n) × n dimensions, wherein F ═ D₁,D₂,…,D_m,E)^T；

an index dimension extension unit for extending each vector D 'in D'_iTo carry outDimension expansion from the k dimension to the k + u +1 dimension to obtain a matrix of mx (k + u +1)

Is shown as

Expanded matrix

Is shown as

Is divided into

And

the segmentation rule is as follows: if S [ n ]]Is equal to 0, then

If S [ n ]]Equal to 1, then will

And

set to two non-equal and non-zero random numbers, and their sum equals

Index encryption unit using key M₁,M₂For the index after division

And

encrypting to obtain the final encrypted index of the ith document

5. The ciphertext sorted search system of claim 4, wherein the create trapdoor module comprises:

the method comprises the steps of establishing a query vector unit, wherein the query vector unit is used for firstly inputting key words when a data user queries, and then carrying out synonym and near synonym expansion on the key words by a program to generate a query vector q; each element of the query vector corresponds to n keywords, denoted q ═ q (q)₁,q₂,…,q_i,…,q_n) If the input keyword matches with the keyword in the keyword dictionary, q_iIs 1; otherwise, the value is 0;

a vector dimension reduction unit for reducing the dimension of the query vector q so that q ' is qR, and reducing q from n dimension to k dimension is denoted as q ' (q ').₁,q′₂,…,q′_i,…,q′_k)；

A query vector dimension expansion unit, configured to perform dimension expansion on q', from the k dimension to the k + u +1 dimension, where the expansion rule is as follows: randomly selecting v dimension from the k +1 dimension to the k + u dimension of q' to be set as 1, setting the other dimensions as 0, multiplying the k + u dimension by a non-zero random number r, and then setting the k + u +1 dimension as a random number t; the expanded query vector is represented as