CN115757435A - Screening factor determination method supporting semantic perception ciphertext retrieval acceleration - Google Patents

Screening factor determination method supporting semantic perception ciphertext retrieval acceleration Download PDF

Info

Publication number
CN115757435A
CN115757435A CN202211579597.0A CN202211579597A CN115757435A CN 115757435 A CN115757435 A CN 115757435A CN 202211579597 A CN202211579597 A CN 202211579597A CN 115757435 A CN115757435 A CN 115757435A
Authority
CN
China
Prior art keywords
semantic
keyword
sequence
document
relevance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211579597.0A
Other languages
Chinese (zh)
Inventor
戴华
刘源龙
周倩
邓寅甫
陈燕俐
杨庚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202211579597.0A priority Critical patent/CN115757435A/en
Publication of CN115757435A publication Critical patent/CN115757435A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of information retrieval, and discloses a screening factor determination method supporting semantic perception ciphertext retrieval acceleration, wherein a semantic relevancy division sequence between each keyword and each document is established in the first stage: calculating a semantic vector of each document by using a semantic perception model; extracting keywords and calculating a semantic vector of each keyword; calculating the semantic relevance of each keyword and each document to form a semantic relevance sequence, and sequencing the sequence in a descending order; performing division, and generating a semantic relevancy division sequence of each keyword and each document for each keyword; and in the second stage, according to the search keywords, sequences are divided by utilizing semantic relevance, and screening factors are calculated and determined. The method for determining the accelerated screening factor is suitable for application scenes based on tree structure indexes in ciphertext retrieval supporting semantic perception, can obviously improve retrieval speed, and has no influence on the accuracy of a search result.

Description

Screening factor determination method supporting semantic perception ciphertext retrieval acceleration
Technical Field
The invention belongs to the technical field of information retrieval, and particularly relates to a screening factor determination method supporting semantic perception ciphertext retrieval acceleration.
Background
With the continuous development of internet technology and the increasing number of various software users, the data size is increasingly huge, and the localized data storage cannot meet the increasing business requirements. To address this dilemma, people have turned to outsourcing the data to cloud servers. Users may use computing resources in amounts as desired by individuals. In short, cloud computing uses the transmission capability of the internet to transmit data information from a local server to the internet and perform data processing on the internet. Although cloud computing has many advantages, there are also some problems, such as data privacy issues. In order to protect the privacy of outsourced data, the most common and most direct method is to encrypt the data before the outsourced data is sent to the cloud server, and then to outsource the encrypted data to the cloud server. However, the availability of encrypted data is reduced, and it is difficult to perform basic operations such as data retrieval. Meanwhile, the semanteme of the encrypted data is reduced, and the semantic relation between the data and the retrieval is difficult to find. Therefore, many searchable encryption methods capable of efficiently and accurately retrieving data on the cloud server while ensuring privacy of outsourced data are proposed.
In recent years, the searchable encryption method proposed by researchers mainly adopts a tree structure index on an index structure to carry out sequencing retrieval on encrypted documents, and the method searches out the most relevant top-k encrypted documents through depth-first search by constructing a tree structure index with simple structure and self safety. For example, the article "Xia Z, wang X, sun X, et al. A secure and dynamic Multi-key clustered Search scheme over Encrypted bound data. IEEE transactions on parallel and distributed systems,2015" uses binary balanced tree index, "Dai H, dai X, yi X, et al. Sematic-aware Multi-key clustered data. Journal of Network and computers, 2019" uses full binary tree index containing semantic feature information, "Hu Z, dai H, yang G, yi X, sheng W. Multi-key clustered Search scheme Z, data H, cloud G, yi X, sheng W. Multi-base-linked Search scheme" uses the implied tree index, the semantic tree index, 2022. Uses the semantic tree index to Search for information.
The general method of the searchable encryption is to convert the documents and the keywords into vector representation, store the documents by using the tree index, encrypt the documents and the index and send the encrypted documents and index to the cloud server. After the user submits the search to the cloud server, the cloud server retrieves the encrypted tree index, returns a ciphertext required by the user, and decrypts the ciphertext. Because the prior retrieval method based on the tree index usually uses depth-first search, and in the searching process, the retrieval screening factors are updated from 0 according to the traversed leaf nodes, and subtrees which do not meet the requirements are pruned by using the screening factors, thereby accelerating the retrieval process; however, the existing tree-index-based retrieval method is as in the three papers mentioned above, in which the initial retrieval filtering factors are all set to 0, if an appropriate filtering factor can be predetermined before the search is started, more sub-trees which do not meet the requirement can be filtered out in the early stage of the search, and the process of depth-first search is accelerated.
Disclosure of Invention
In order to solve the technical problem, the invention provides a screening factor determination method supporting semantic perception ciphertext retrieval acceleration, which can improve the retrieval efficiency under the condition of not influencing the retrieval result precision.
The invention relates to a screening factor determination method supporting semantic perception ciphertext retrieval acceleration, which comprises the following steps:
step 1, establishing a semantic relevancy division sequence of each keyword and each document for each keyword;
and 2, dividing the sequence by utilizing the semantic relevance according to the search keywords, and calculating a screening factor.
Further, step 1 specifically comprises:
step 1a, calculating each document D in the document set D by utilizing a semantic perception model j D = { D }, D = 1 ,d 2 ,…,d j ,…,d n J ranges from 1 to n; from each document d j Extracting keywords from the database to generate a keyword set W, W = { W = { W = } 1 ,w 2 ,…,w i ,…,w m I value range 1-m, and calculate each keyword w i The semantic vector of (2);
step 1b, for each keyword W in W i E.g. W, calculating the relation between the e and each document D in the document set D j Semantic relevance relevelence (w) for e D i ,d j ) Establishing w i Semantic relevance sequence L with each document in D i Then, sequencing the sequence according to descending order;
step 1c, according to w i Semantic relatedness sequence L of i And a given segmentation parameter tau, dividing the sequence by equal amounts to generate w i Semantic relatedness partitioning sequence with documents
Figure BDA0003990144150000021
Each partition is represented as a doublet
Figure BDA0003990144150000022
Wherein
Figure BDA0003990144150000023
And
Figure BDA0003990144150000024
representing the upper and lower boundaries of this partition.
Further, step 1c specifically includes:
step 1c1, for each w i With relevance scores of documents in DArranging in descending order to generate a semantic relevance sequence L i (ii) a For each keyword W in W i For L, based on the segmentation parameter τ i Making equal partition to construct w i Corresponding include
Figure BDA0003990144150000031
Semantic relatedness of individual partitions;
step 1c2, partition sequence
Figure BDA0003990144150000032
Wherein front is
Figure BDA0003990144150000033
Each partition contains tau relevance scores, the last partition contains less than or equal to tau, and for any two adjacent partitions
Figure BDA0003990144150000034
And
Figure BDA0003990144150000035
in the case of a non-woven fabric,
Figure BDA0003990144150000036
is greater than
Figure BDA0003990144150000037
Any of the relevancy scores;
step 1c3 for SPT i Each of which is partitioned
Figure BDA0003990144150000038
Structural doublet
Figure BDA0003990144150000039
For computing each partition
Figure BDA00039901441500000310
And
Figure BDA00039901441500000311
further, step 1c3 specifically includes:
for w i Corresponding SPT i Each partition of (1), which divides the doublet
Figure BDA00039901441500000312
And
Figure BDA00039901441500000313
wherein rand (X, y) represents a random value between X and y, min (X) represents a minimum value of an element in the set X, and max (X) represents a maximum value of an element in the set X:
Figure BDA00039901441500000314
Figure BDA00039901441500000315
Figure BDA00039901441500000316
further, step 2 specifically comprises:
step 2a, if Q is a retrieval keyword set delivered by a user, k is the number of documents required to be retrieved by the user; for each search keyword w in Q n And the value range of n is 1- | Q |, and the union U of the document mark sets in the previous x semantic relevance partitions is calculated x
Figure BDA00039901441500000317
If U is x Satisfies the following formula conditions, then
Figure BDA00039901441500000318
Is namely w n Corresponding local retrieval screening factors;
Figure BDA00039901441500000319
step 2b, searching all the keywords w in the set Q n Calculating a final screening factor t according to the following formula;
Figure BDA00039901441500000320
further, step 2a specifically includes:
for each keyword w n Said U x The calculation method of (2) is as follows:
Figure BDA0003990144150000041
the invention has the beneficial effects that: 1. by utilizing the retrieval screening factor determination method, more sub-trees which do not meet the requirements can be screened out, and the encryption searching process is remarkably accelerated;
2. the invention utilizes the semantic relevance to divide the sequence and confirm and search the screening factor, this searches the screening factor and will not reveal every file and key word between the relevance score, and search the screening factor and is smaller than the relevance score of the last file in the candidate result set, will not miss and examine; therefore, the invention can accelerate the retrieval process on the premise of ensuring that the retrieval result is not changed;
3. the invention supports the ciphertext retrieval application scene of tree structure index based on semantic perception, does not depend on a specific keyword and document correlation quantification method, can be used by all correlation measurement methods (LDAmodel, BERT model) based on semantic perception, and has stronger universality.
Drawings
FIG. 1 is a flow chart of a search screening factor determination method according to the present invention;
FIG. 2 is a schematic diagram of a semantic relatedness partitioning sequence generated by the present invention;
FIG. 3 is a diagram illustrating an exemplary search process with a search screening factor of 0 according to the present invention;
FIG. 4 is a diagram illustrating an exemplary search process with a search screening factor of 0.51 according to the present invention.
Detailed Description
In order that the manner in which the present invention is attained and can be understood in detail, a more particular description of the invention briefly summarized above may be had by reference to the embodiments thereof which are illustrated in the appended drawings.
For convenience of description, the following definitions are now made for the relevant symbols:
document set D = { D = { [ D ] 1 ,d 2 ,…,d n D, the words contained in each document form a keyword set W = { W = { (W) 1 ,w 2 ,…,w m Q is a retrieval keyword set submitted by a user, and k is the number of documents to be returned for retrieval; relevance (w) i ,d j ) Representing a keyword w i And document d j Single keyword-single document semantic relatedness score therebetween; SPT i Is w i Dividing a sequence by the semantic relevance of each document;
Figure BDA0003990144150000042
and
Figure BDA0003990144150000043
represents w i Upper and lower boundaries of the xth partition.
FIG. 1 is a flow chart of the present invention depicting a method of screening factor computation to support semantic aware ciphertext retrieval acceleration. Calculating a semantic vector of each document by using a semantic perception model; extracting keywords from the document, and calculating a semantic vector of each keyword; calculating the semantic relevance of each keyword and each document to form a semantic relevance sequence, and sequencing the sequence according to descending order; performing division, and generating a semantic relevancy division sequence of each keyword and each document for each keyword; and according to the search keywords, dividing the sequence by using the semantic relevance of the subject, and calculating and determining a screening factor.
The invention relates to a screening factor determination method supporting semantic perception ciphertext retrieval acceleration, which comprises two stages: (1) constructing semantic correlation degree division sequence stage; (2) calculating and determining screening factors;
the first stage is as follows: and constructing a semantic relevancy division sequence of each keyword and each document for each keyword.
The method comprises the following specific steps:
step 1a, calculating each document d by utilizing a semantic perception model j The semantic vector of (2); extracting keywords from the document, generating a keyword set W, and calculating each keyword W i The semantic vector of (2);
step 1b, aiming at each keyword W in W i E.g. W, calculate it and every document D in D j Semantic relevance (w) for E D i ,d j ) Establishing w i Semantic relevance sequence L to each document in D i Then, sequencing the sequence according to descending order;
step 1c, according to w i Semantic relatedness sequence L of i And a given segmentation parameter tau, equally dividing the sequence to generate w i Semantic relatedness partitioning sequence with documents
Figure BDA0003990144150000051
Each partition is represented as a doublet
Figure BDA0003990144150000052
Wherein
Figure BDA0003990144150000053
And
Figure BDA0003990144150000054
representing the upper and lower boundaries of this partition; generated semantic relatedness partitioning sequence SPT i As shown in fig. 2, the specific generation steps are as follows:
step 1c1, for each w i The relevance scores of the documents in the D are arranged in a descending order to generate a semantic relevance sequence L i (ii) a For each keyword W in W i According to the segmentation parameter tau, for L i Perform equal-amount scribingRespectively, construct w i Corresponding comprises
Figure BDA0003990144150000055
Semantic relatedness of each partition;
step 1c2, partition sequence
Figure BDA0003990144150000056
Wherein front
Figure BDA0003990144150000057
Each partition contains tau relevance scores, the last partition contains a number of documents equal to or less than tau, and for any two adjacent partitions
Figure BDA0003990144150000058
And
Figure BDA0003990144150000059
in the case of a non-woven fabric,
Figure BDA00039901441500000510
is greater than
Figure BDA00039901441500000511
Any of the relevancy scores;
step 1c3 for SPT i Each of which is partitioned
Figure BDA00039901441500000512
Construction binary set
Figure BDA00039901441500000513
For computing each partition
Figure BDA00039901441500000514
And
Figure BDA00039901441500000515
the calculation method is as follows. Where rand (X, y) represents a random value between X and y, and min (X) represents the most significant element in the set XSmall value, max (X) represents the maximum value of the elements in set X;
Figure BDA0003990144150000061
Figure BDA0003990144150000062
Figure BDA0003990144150000063
and a second stage: according to the search keywords, dividing the sequence by using the semantic relevance of the subject, and calculating a screening factor:
step 2a, if Q is a retrieval keyword set delivered by a user, k is the number of documents required to be retrieved by the user; for each search keyword w in Q n And the value range of n is 1- | Q |, and the union U of the document mark sets in the previous x semantic relevance partitions is calculated x
Figure BDA0003990144150000064
For each keyword w n Said U x The calculation method of (2) is as follows:
Figure BDA0003990144150000065
if U is x Satisfies the following formula conditions, then
Figure BDA0003990144150000066
Is namely w n Corresponding local retrieval screening factors;
Figure BDA0003990144150000067
step 2b, searching all the keywords w in the set Q n The final value is calculated according to the following formulaThe screening factor t of (1).
Figure BDA0003990144150000068
The effect of the accelerated Search process of the present invention will be described by taking the method described in the paper "Hu Z, dai H, yang G, yi X, sheng W.Semantic-Based Multi-key Search Schemes over Encrypted Cloud data. Security and Communication Networks,2022.
Assume document set D =<d 1 ,d 3 ,d 4 ,d 2 ,d 6 ,d 5 >And constructing a tree index according to the tree index, and supposing to retrieve the theme vector V of Q Q = (0, 0.8,0, 0.5), retrieval requires that the two most relevant documents k =2 be returned.
FIG. 3 is a search process with a filter factor of 0, starting from the root node, through r, r 2 ,r 3 To the first leaf node d 1 ,d 1 And V Q Has a semantic relevance score of relevance (V) Q ,d 1 ) =0.56 and d 1 Is added to R; then, the search passes r 3 Reach leaf node d 3 Semantic relevance score of relevance (V) Q ,d 3 ) =0.48 and d 3 Is added to the result set R; at this time, the filtering factor is updated to t =0.48; then, r 4 Node d of 4 And d 2 Is pruned because of relevance (V) Q ,d 4 )=0.4<t,relevance(V Q ,d 2 )=0.1<t; then, the search is performed through r, r 5 To d 6 Because of relevelence (V) Q ,d 6 )=0.53>t, so will d 6 Add to R and sort R in descending order. At this time, the filtering factor is updated to t =0.53. Due to relevance (V) Q ,d 5 )=0.4<t, so node d 5 Is pruned. Finally, the search result is R =<d 1 ,d 6 >。
FIG. 4 shows the screening process when the screening factor is 0.51. Unlike the above process, when the search passes d 3 Time, relevance (V) Q ,d 3 )=0.48<t node d 3 Is pruned. When retrieving r 4 Due to relevelence (V) Q ,r 4 )=0.5<t, the node and its subtree d 4 And d 2 Are pruned. According to the comparison of the retrieval example, the screening factor can screen more subtrees in advance, so that the retrieval process is accelerated, and the retrieval result is kept unchanged.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention further, and all equivalent variations made by using the contents of the present specification and the drawings are within the scope of the present invention.

Claims (6)

1. A screening factor determination method supporting semantic perception ciphertext retrieval acceleration is characterized by comprising the following steps:
step 1, establishing a semantic relevancy division sequence of each keyword and each document for each keyword;
and 2, dividing the sequence by utilizing the semantic relevance according to the search keywords, and calculating a screening factor.
2. The method for determining the screening factor supporting the acceleration of the semantic perception ciphertext retrieval according to claim 1, wherein the step 1 specifically comprises:
step 1a, calculating each document D in the document set D by utilizing a semantic perception model j Semantic vector of (D = { D) 1 ,d 2 ,…,d j ,…,d n J takes a value range of 1-n; from each document d j The key words are extracted from the Chinese character, generating a set of keywords W, W = { W 1 ,w 2 ,…,w i ,…,w m I value range 1-m, and calculating each keyword w i The semantic vector of (2);
step 1b, aiming at each keyword W in W i E.g. W, calculating the same and each document D in the document set D j Semantic relevance relevelence (w) for e D i ,d j ) Establishing w i Semantic relevance sequence L to each document in D i Then according to descendingSequencing the sequence;
step 1c, according to w i Semantic relatedness sequence L of i And a given segmentation parameter tau, dividing the sequence by equal amounts to generate w i Semantic relatedness partitioning sequence with documents
Figure FDA0003990144140000011
Each partition is represented as a doublet
Figure FDA0003990144140000012
Wherein
Figure FDA0003990144140000013
And
Figure FDA0003990144140000014
representing the upper and lower boundaries of this partition.
3. The method for determining the screening factor supporting the semantic perception ciphertext retrieval acceleration according to claim 2, wherein the step 1c specifically comprises:
step 1c1, for each w i The relevance scores of the documents in the D are arranged in a descending order to generate a semantic relevance sequence L i (ii) a For each keyword W in W i For L, based on the segmentation parameter τ i Partition by equal amount to construct w i Corresponding include
Figure FDA0003990144140000015
Semantic relatedness of individual partitions;
step 1c2, dividing the sequence
Figure FDA0003990144140000016
Wherein front is
Figure FDA0003990144140000017
Each partition comprises tau correlation scores, and the last partition comprisesContaining a number of documents equal to or less than τ and for any two adjacent partitions
Figure FDA0003990144140000018
And
Figure FDA0003990144140000019
in the case of a non-woven fabric,
Figure FDA00039901441400000110
is greater than
Figure FDA00039901441400000111
Any of the relevancy scores of (a);
step 1c3 for SPT i Each of which is partitioned
Figure FDA00039901441400000112
Structural doublet
Figure FDA0003990144140000021
Computing each partition
Figure FDA0003990144140000022
And
Figure FDA0003990144140000023
4. the method for determining the screening factor supporting semantic perception ciphertext retrieval acceleration according to claim 3, wherein the step 1c3 specifically comprises:
for w i Corresponding SPT i Each partition of (1), which divides the doublet
Figure FDA0003990144140000024
And
Figure FDA0003990144140000025
wherein rand (X, y) represents a random value between X and y, min (X) represents a minimum value of an element in the set X, and max (X) represents a maximum value of an element in the set X:
Figure FDA0003990144140000026
Figure FDA0003990144140000027
Figure FDA0003990144140000028
5. the method for determining the screening factor supporting the semantic perception ciphertext retrieval acceleration according to claim 1, wherein the step 2 specifically comprises:
step 2a, if Q is a retrieval keyword set delivered by a user, k is the number of documents required to be retrieved by the user; for each search keyword w in Q n And the value range of n is 1- | Q |, and the union U of the document mark sets in the previous x semantic relevance partitions is calculated x
Figure FDA0003990144140000029
If U is x Satisfies the following formula conditions, then
Figure FDA00039901441400000210
Is namely w n Corresponding local retrieval screening factors;
Figure FDA00039901441400000211
step 2b, searching all the keywords w in the set Q n According to the followingCalculating a final screening factor t;
Figure FDA00039901441400000212
6. the method for determining the screening factor supporting the semantic perception ciphertext retrieval acceleration according to claim 5, wherein the step 2a specifically comprises:
for each keyword w n Said U is x The calculation method of (2) is as follows:
Figure FDA00039901441400000213
CN202211579597.0A 2022-12-09 2022-12-09 Screening factor determination method supporting semantic perception ciphertext retrieval acceleration Pending CN115757435A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211579597.0A CN115757435A (en) 2022-12-09 2022-12-09 Screening factor determination method supporting semantic perception ciphertext retrieval acceleration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211579597.0A CN115757435A (en) 2022-12-09 2022-12-09 Screening factor determination method supporting semantic perception ciphertext retrieval acceleration

Publications (1)

Publication Number Publication Date
CN115757435A true CN115757435A (en) 2023-03-07

Family

ID=85346670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211579597.0A Pending CN115757435A (en) 2022-12-09 2022-12-09 Screening factor determination method supporting semantic perception ciphertext retrieval acceleration

Country Status (1)

Country Link
CN (1) CN115757435A (en)

Similar Documents

Publication Publication Date Title
CN108304444B (en) Information query method and device
US10339161B2 (en) Expanding network relationships
CN110704743B (en) Semantic search method and device based on knowledge graph
CN100541495C (en) A kind of searching method of individual searching engine
US5926812A (en) Document extraction and comparison method with applications to automatic personalized database searching
CN102207945B (en) Knowledge network-based text indexing system and method
CN107590128B (en) Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
Murugesan et al. Providing privacy through plausibly deniable search
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN101727447A (en) Generation method and device of regular expression based on URL
CN108399213B (en) User-oriented personal file clustering method and system
US10275486B2 (en) Multi-system segmented search processing
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
CN113297457B (en) High-precision intelligent information resource pushing system and pushing method
CN103678550A (en) Mass data real-time query method based on dynamic index structure
CN113377876A (en) Domino platform-based data sub-database processing method, device and platform
CN106294784B (en) resource searching method and device
CN103186650A (en) Searching method and device
JP4219122B2 (en) Feature word extraction system
CN109918661A (en) Synonym acquisition methods and device
Tejasree et al. An improved differential bond energy algorithm with fuzzy merging method to improve the document clustering for information mining
CN115757435A (en) Screening factor determination method supporting semantic perception ciphertext retrieval acceleration
CN108256086A (en) Data characteristics statistical analysis technique
CN108256083A (en) Content recommendation method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination