CN115757435A

CN115757435A - Screening factor determination method supporting semantic perception ciphertext retrieval acceleration

Info

Publication number: CN115757435A
Application number: CN202211579597.0A
Authority: CN
Inventors: 戴华; 刘源龙; 周倩; 邓寅甫; 陈燕俐; 杨庚
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2023-03-07

Abstract

The invention belongs to the technical field of information retrieval, and discloses a screening factor determination method supporting semantic perception ciphertext retrieval acceleration, wherein a semantic relevancy division sequence between each keyword and each document is established in the first stage: calculating a semantic vector of each document by using a semantic perception model; extracting keywords and calculating a semantic vector of each keyword; calculating the semantic relevance of each keyword and each document to form a semantic relevance sequence, and sequencing the sequence in a descending order; performing division, and generating a semantic relevancy division sequence of each keyword and each document for each keyword; and in the second stage, according to the search keywords, sequences are divided by utilizing semantic relevance, and screening factors are calculated and determined. The method for determining the accelerated screening factor is suitable for application scenes based on tree structure indexes in ciphertext retrieval supporting semantic perception, can obviously improve retrieval speed, and has no influence on the accuracy of a search result.

Description

Screening factor determination method supporting semantic perception ciphertext retrieval acceleration

Technical Field

The invention belongs to the technical field of information retrieval, and particularly relates to a screening factor determination method supporting semantic perception ciphertext retrieval acceleration.

Background

With the continuous development of internet technology and the increasing number of various software users, the data size is increasingly huge, and the localized data storage cannot meet the increasing business requirements. To address this dilemma, people have turned to outsourcing the data to cloud servers. Users may use computing resources in amounts as desired by individuals. In short, cloud computing uses the transmission capability of the internet to transmit data information from a local server to the internet and perform data processing on the internet. Although cloud computing has many advantages, there are also some problems, such as data privacy issues. In order to protect the privacy of outsourced data, the most common and most direct method is to encrypt the data before the outsourced data is sent to the cloud server, and then to outsource the encrypted data to the cloud server. However, the availability of encrypted data is reduced, and it is difficult to perform basic operations such as data retrieval. Meanwhile, the semanteme of the encrypted data is reduced, and the semantic relation between the data and the retrieval is difficult to find. Therefore, many searchable encryption methods capable of efficiently and accurately retrieving data on the cloud server while ensuring privacy of outsourced data are proposed.

In recent years, the searchable encryption method proposed by researchers mainly adopts a tree structure index on an index structure to carry out sequencing retrieval on encrypted documents, and the method searches out the most relevant top-k encrypted documents through depth-first search by constructing a tree structure index with simple structure and self safety. For example, the article "Xia Z, wang X, sun X, et al. A secure and dynamic Multi-key clustered Search scheme over Encrypted bound data. IEEE transactions on parallel and distributed systems,2015" uses binary balanced tree index, "Dai H, dai X, yi X, et al. Sematic-aware Multi-key clustered data. Journal of Network and computers, 2019" uses full binary tree index containing semantic feature information, "Hu Z, dai H, yang G, yi X, sheng W. Multi-key clustered Search scheme Z, data H, cloud G, yi X, sheng W. Multi-base-linked Search scheme" uses the implied tree index, the semantic tree index, 2022. Uses the semantic tree index to Search for information.

The general method of the searchable encryption is to convert the documents and the keywords into vector representation, store the documents by using the tree index, encrypt the documents and the index and send the encrypted documents and index to the cloud server. After the user submits the search to the cloud server, the cloud server retrieves the encrypted tree index, returns a ciphertext required by the user, and decrypts the ciphertext. Because the prior retrieval method based on the tree index usually uses depth-first search, and in the searching process, the retrieval screening factors are updated from 0 according to the traversed leaf nodes, and subtrees which do not meet the requirements are pruned by using the screening factors, thereby accelerating the retrieval process; however, the existing tree-index-based retrieval method is as in the three papers mentioned above, in which the initial retrieval filtering factors are all set to 0, if an appropriate filtering factor can be predetermined before the search is started, more sub-trees which do not meet the requirement can be filtered out in the early stage of the search, and the process of depth-first search is accelerated.

Disclosure of Invention

In order to solve the technical problem, the invention provides a screening factor determination method supporting semantic perception ciphertext retrieval acceleration, which can improve the retrieval efficiency under the condition of not influencing the retrieval result precision.

The invention relates to a screening factor determination method supporting semantic perception ciphertext retrieval acceleration, which comprises the following steps:

step 1, establishing a semantic relevancy division sequence of each keyword and each document for each keyword;

and 2, dividing the sequence by utilizing the semantic relevance according to the search keywords, and calculating a screening factor.

Further, step 1 specifically comprises:

step 1a, calculating each document D in the document set D by utilizing a semantic perception model _j D = { D }, D = ₁ ,d ₂ ,…,d _j ,…,d _n J ranges from 1 to n; from each document d _j Extracting keywords from the database to generate a keyword set W, W = { W = { W = } ₁ ,w ₂ ,…,w _i ,…,w _m I value range 1-m, and calculate each keyword w _i The semantic vector of (2);

step 1b, for each keyword W in W _i E.g. W, calculating the relation between the e and each document D in the document set D _j Semantic relevance relevelence (w) for e D _i ,d _j ) Establishing w _i Semantic relevance sequence L with each document in D _i Then, sequencing the sequence according to descending order;

step 1c, according to w _i Semantic relatedness sequence L of _i And a given segmentation parameter tau, dividing the sequence by equal amounts to generate w _i Semantic relatedness partitioning sequence with documents

Each partition is represented as a doublet

Wherein

And

representing the upper and lower boundaries of this partition.

Further, step 1c specifically includes:

step 1c1, for each w _i With relevance scores of documents in DArranging in descending order to generate a semantic relevance sequence L _i (ii) a For each keyword W in W _i For L, based on the segmentation parameter τ _i Making equal partition to construct w _i Corresponding include

Semantic relatedness of individual partitions;

step 1c2, partition sequence

Wherein front is

Each partition contains tau relevance scores, the last partition contains less than or equal to tau, and for any two adjacent partitions

And

in the case of a non-woven fabric,

is greater than

Any of the relevancy scores;

step 1c3 for SPT _i Each of which is partitioned

Structural doublet

For computing each partition

And

further, step 1c3 specifically includes:

for w _i Corresponding SPT _i Each partition of (1), which divides the doublet

And

wherein rand (X, y) represents a random value between X and y, min (X) represents a minimum value of an element in the set X, and max (X) represents a maximum value of an element in the set X:

further, step 2 specifically comprises:

step 2a, if Q is a retrieval keyword set delivered by a user, k is the number of documents required to be retrieved by the user; for each search keyword w in Q _n And the value range of n is 1- | Q |, and the union U of the document mark sets in the previous x semantic relevance partitions is calculated _x ，

If U is _x Satisfies the following formula conditions, then

Is namely w _n Corresponding local retrieval screening factors;

step 2b, searching all the keywords w in the set Q _n Calculating a final screening factor t according to the following formula;

further, step 2a specifically includes:

for each keyword w _n Said U _x The calculation method of (2) is as follows:

the invention has the beneficial effects that: 1. by utilizing the retrieval screening factor determination method, more sub-trees which do not meet the requirements can be screened out, and the encryption searching process is remarkably accelerated;

2. the invention utilizes the semantic relevance to divide the sequence and confirm and search the screening factor, this searches the screening factor and will not reveal every file and key word between the relevance score, and search the screening factor and is smaller than the relevance score of the last file in the candidate result set, will not miss and examine; therefore, the invention can accelerate the retrieval process on the premise of ensuring that the retrieval result is not changed;

3. the invention supports the ciphertext retrieval application scene of tree structure index based on semantic perception, does not depend on a specific keyword and document correlation quantification method, can be used by all correlation measurement methods (LDAmodel, BERT model) based on semantic perception, and has stronger universality.

Drawings

FIG. 1 is a flow chart of a search screening factor determination method according to the present invention;

FIG. 2 is a schematic diagram of a semantic relatedness partitioning sequence generated by the present invention;

FIG. 3 is a diagram illustrating an exemplary search process with a search screening factor of 0 according to the present invention;

FIG. 4 is a diagram illustrating an exemplary search process with a search screening factor of 0.51 according to the present invention.

Detailed Description

In order that the manner in which the present invention is attained and can be understood in detail, a more particular description of the invention briefly summarized above may be had by reference to the embodiments thereof which are illustrated in the appended drawings.

For convenience of description, the following definitions are now made for the relevant symbols:

document set D = { D = { [ D ] ₁ ,d ₂ ,…,d _n D, the words contained in each document form a keyword set W = { W = { (W) ₁ ,w ₂ ,…,w _m Q is a retrieval keyword set submitted by a user, and k is the number of documents to be returned for retrieval; relevance (w) _i ,d _j ) Representing a keyword w _i And document d _j Single keyword-single document semantic relatedness score therebetween; SPT _i Is w _i Dividing a sequence by the semantic relevance of each document;

and

represents w _i Upper and lower boundaries of the xth partition.

FIG. 1 is a flow chart of the present invention depicting a method of screening factor computation to support semantic aware ciphertext retrieval acceleration. Calculating a semantic vector of each document by using a semantic perception model; extracting keywords from the document, and calculating a semantic vector of each keyword; calculating the semantic relevance of each keyword and each document to form a semantic relevance sequence, and sequencing the sequence according to descending order; performing division, and generating a semantic relevancy division sequence of each keyword and each document for each keyword; and according to the search keywords, dividing the sequence by using the semantic relevance of the subject, and calculating and determining a screening factor.

The invention relates to a screening factor determination method supporting semantic perception ciphertext retrieval acceleration, which comprises two stages: (1) constructing semantic correlation degree division sequence stage; (2) calculating and determining screening factors;

the first stage is as follows: and constructing a semantic relevancy division sequence of each keyword and each document for each keyword.

The method comprises the following specific steps:

step 1a, calculating each document d by utilizing a semantic perception model _j The semantic vector of (2); extracting keywords from the document, generating a keyword set W, and calculating each keyword W _i The semantic vector of (2);

step 1b, aiming at each keyword W in W _i E.g. W, calculate it and every document D in D _j Semantic relevance (w) for E D _i ,d _j ) Establishing w _i Semantic relevance sequence L to each document in D _i Then, sequencing the sequence according to descending order;

step 1c, according to w _i Semantic relatedness sequence L of _i And a given segmentation parameter tau, equally dividing the sequence to generate w _i Semantic relatedness partitioning sequence with documents

Each partition is represented as a doublet

Wherein

And

representing the upper and lower boundaries of this partition; generated semantic relatedness partitioning sequence SPT _i As shown in fig. 2, the specific generation steps are as follows:

step 1c1, for each w _i The relevance scores of the documents in the D are arranged in a descending order to generate a semantic relevance sequence L _i (ii) a For each keyword W in W _i According to the segmentation parameter tau, for L _i Perform equal-amount scribingRespectively, construct w _i Corresponding comprises

Semantic relatedness of each partition;

step 1c2, partition sequence

Wherein front

Each partition contains tau relevance scores, the last partition contains a number of documents equal to or less than tau, and for any two adjacent partitions

And

in the case of a non-woven fabric,

is greater than

Any of the relevancy scores;

step 1c3 for SPT _i Each of which is partitioned

Construction binary set

For computing each partition

And

the calculation method is as follows. Where rand (X, y) represents a random value between X and y, and min (X) represents the most significant element in the set XSmall value, max (X) represents the maximum value of the elements in set X;

and a second stage: according to the search keywords, dividing the sequence by using the semantic relevance of the subject, and calculating a screening factor:

For each keyword w _n Said U _x The calculation method of (2) is as follows:

if U is _x Satisfies the following formula conditions, then

Is namely w _n Corresponding local retrieval screening factors;

step 2b, searching all the keywords w in the set Q _n The final value is calculated according to the following formulaThe screening factor t of (1).

The effect of the accelerated Search process of the present invention will be described by taking the method described in the paper "Hu Z, dai H, yang G, yi X, sheng W.Semantic-Based Multi-key Search Schemes over Encrypted Cloud data. Security and Communication Networks,2022.

Assume document set D =<d ₁ ,d ₃ ,d ₄ ,d ₂ ,d ₆ ,d ₅ >And constructing a tree index according to the tree index, and supposing to retrieve the theme vector V of Q _Q = (0, 0.8,0, 0.5), retrieval requires that the two most relevant documents k =2 be returned.

FIG. 3 is a search process with a filter factor of 0, starting from the root node, through r, r ₂ ，r ₃ To the first leaf node d ₁ ，d ₁ And V _Q Has a semantic relevance score of relevance (V) _Q ,d ₁ ) =0.56 and d ₁ Is added to R; then, the search passes r ₃ Reach leaf node d ₃ Semantic relevance score of relevance (V) _Q ,d ₃ ) =0.48 and d ₃ Is added to the result set R; at this time, the filtering factor is updated to t =0.48; then, r ₄ Node d of ₄ And d ₂ Is pruned because of relevance (V) _Q ,d ₄ )＝0.4<t，relevance(V _Q ,d ₂ )＝0.1<t; then, the search is performed through r, r ₅ To d ₆ Because of relevelence (V) _Q ,d ₆ )＝0.53>t, so will d ₆ Add to R and sort R in descending order. At this time, the filtering factor is updated to t =0.53. Due to relevance (V) _Q ,d ₅ )＝0.4<t, so node d ₅ Is pruned. Finally, the search result is R =<d ₁ ,d ₆ >。

FIG. 4 shows the screening process when the screening factor is 0.51. Unlike the above process, when the search passes d ₃ Time, relevance (V) _Q ,d ₃ )＝0.48<t node d ₃ Is pruned. When retrieving r ₄ Due to relevelence (V) _Q ,r ₄ )＝0.5<t, the node and its subtree d ₄ And d ₂ Are pruned. According to the comparison of the retrieval example, the screening factor can screen more subtrees in advance, so that the retrieval process is accelerated, and the retrieval result is kept unchanged.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention further, and all equivalent variations made by using the contents of the present specification and the drawings are within the scope of the present invention.

Claims

1. A screening factor determination method supporting semantic perception ciphertext retrieval acceleration is characterized by comprising the following steps:

2. The method for determining the screening factor supporting the acceleration of the semantic perception ciphertext retrieval according to claim 1, wherein the step 1 specifically comprises:

step 1a, calculating each document D in the document set D by utilizing a semantic perception model _j Semantic vector of (D = { D) ₁ ,d ₂ ,…,d _j ,…,d _n J takes a value range of 1-n; from each document d _j The key words are extracted from the Chinese character, generating a set of keywords W, W = { W ₁ ,w ₂ ,…,w _i ,…,w _m I value range 1-m, and calculating each keyword w _i The semantic vector of (2);

step 1b, aiming at each keyword W in W _i E.g. W, calculating the same and each document D in the document set D _j Semantic relevance relevelence (w) for e D _i ,d _j ) Establishing w _i Semantic relevance sequence L to each document in D _i Then according to descendingSequencing the sequence;

Each partition is represented as a doublet

Wherein

And

representing the upper and lower boundaries of this partition.

3. The method for determining the screening factor supporting the semantic perception ciphertext retrieval acceleration according to claim 2, wherein the step 1c specifically comprises:

step 1c1, for each w _i The relevance scores of the documents in the D are arranged in a descending order to generate a semantic relevance sequence L _i (ii) a For each keyword W in W _i For L, based on the segmentation parameter τ _i Partition by equal amount to construct w _i Corresponding include

Semantic relatedness of individual partitions;

step 1c2, dividing the sequence

Wherein front is

Each partition comprises tau correlation scores, and the last partition comprisesContaining a number of documents equal to or less than τ and for any two adjacent partitions

And

in the case of a non-woven fabric,

is greater than

Any of the relevancy scores of (a);

step 1c3 for SPT _i Each of which is partitioned

Structural doublet

Computing each partition

And

4. the method for determining the screening factor supporting semantic perception ciphertext retrieval acceleration according to claim 3, wherein the step 1c3 specifically comprises:

for w _i Corresponding SPT _i Each partition of (1), which divides the doublet

And

5. the method for determining the screening factor supporting the semantic perception ciphertext retrieval acceleration according to claim 1, wherein the step 2 specifically comprises:

If U is _x Satisfies the following formula conditions, then

Is namely w _n Corresponding local retrieval screening factors;

step 2b, searching all the keywords w in the set Q _n According to the followingCalculating a final screening factor t;

6. the method for determining the screening factor supporting the semantic perception ciphertext retrieval acceleration according to claim 5, wherein the step 2a specifically comprises:

for each keyword w _n Said U is _x The calculation method of (2) is as follows: