CN109710607A

CN109710607A - A kind of hash query method solved based on weight towards higher-dimension big data

Info

Publication number: CN109710607A
Application number: CN201811317132.1A
Authority: CN
Inventors: 孙瑶; 钱江波; 胡伟; 任艳多
Original assignee: Ningbo University
Current assignee: Dragon Totem Technology Hefei Co ltd; Shanghai Haonashi Network Technology Co.,Ltd.
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2019-05-03
Anticipated expiration: 2038-11-07
Also published as: CN109710607B

Abstract

The hash query method solved based on weight towards higher-dimension big data that the invention discloses a kind of, feature is first with Principal Component Analysis respectively by original High Dimensional Data Set and given inquiry Data Dimensionality Reduction, then it constructs loss function and obtains final binary coded matrix and final weight matrix by minimizing the function in an iterative process, quantization is weighted to each element in final binary coded matrix further according to final weight matrix, binary coded matrix and binary coding corresponding with given inquiry data after being weighted, finally using the nearest corresponding original high dimensional data of row vector data of binary-coded weighting functions corresponding with given inquiry data as final K-NN search result in the binary coded matrix after weighting；Advantage is by replacing Hamming distances using weighting functions, and the accuracy and efficiency inquired given inquiry data greatly improves.

Description

A kind of hash query method solved based on weight towards higher-dimension big data

Technical field

The present invention relates to a kind of hash query method, in particular to it is a kind of towards higher-dimension big data based on weight solve Hash query method.

Background technique

Approximate near neighbor problem is a Computer Subject underlying issue.Under normal circumstances, Hash technology is to be able to solve greatly A kind of effective ways of scale high dimensional data inquiry.In the related technology, the Hash of this data set is encoded, Hash technology is not Consider the weight of each dimension, that is, think that each dimensionality weight is equal, but Hash coding different dimensions are for similitude between data Influence is different, and the relevant technologies do not fully consider Hash code weight, is had much room for improvement.

Summary of the invention

Technical problem to be solved by the invention is to provide it is a kind of can characteristic for data sets coding is weighted Quantization obtains corresponding Hash coding, reduces data space and calculates cost, improves the big towards higher-dimension of inquiry accuracy The hash query method of data solved based on weight.

The technical scheme of the invention to solve the technical problem is: it is a kind of towards higher-dimension big data based on weight The hash query method of solution, comprising the following steps:

1. obtaining the original High Dimensional Data Set X being made of n original high dimensional datas and given inquiry data q, X being n × d The matrix of dimension carries out dimensionality reduction to X using Principal Component Analysis, obtains low-dimensional vector set V corresponding with X,Wherein, V is the matrix of n × c dimension, v_ijIndicate i-th of data jth dimension in original high dimensional data The corresponding low-dimensional vector element in V reuses Principal Component Analysis and carries out dimensionality reduction to q, obtains low-dimensional vector corresponding with q q'；

2. obtaining final binary coded matrix B " and final weight matrix W " by iteration, detailed process is as follows:

2. -1 setting maximum number of iterations, gives initial binary encoder matrix B, B ∈ { -1,1 } at random^n×c, random given Initial weight matrix W, W=diag (w₁, w₂... w_j..., w_c), wherein w_jIndicate the dimension weight of jth dimension；

2. -2 start iterative process, during current an iteration, holding W first is constant, by right Minimize solving and B is updated, it willThe B updated when minimum is denoted as B ', Wherein, | | | |_FFor the F- norm sign for taking matrix,In 2 indicate squared symbols, b_ijIt indicates The corresponding binary coded value of i-th of data jth dimension in original high dimensional data；

2. -3 are updated the dimension weight of all dimensions during current an iteration, wherein to w_jIt is updated Process it is as follows: by B ' jth column column vector be denoted as β_j, the jth column column vector in V is denoted as γ_j, then have B '={ β₁, β₂,…β_j…,β_c, V={ γ₁,γ₂,…γ_j…,γ_c, keep B ' constant, by right

,Minimize and solves to w_jIt is updated, wherein | | | |₂To take the 2- norm of matrix to accord with Number, it willThe w updated when minimum_jIt is denoted as w_j', by all dimensions updated during current an iteration Dimension weight arranged in sequence after obtained weight matrix be denoted as W'；

2. -4 judge whether the number of iterations reaches the maximum number of iterations of setting, if not up to maximum number of iterations, enables W=W', B=B ', 2. -2 beginning next iteration process, Simultaneous Iteration number add 1 to return step, wherein W=W' and B=B ' In "=" be assignment；If reaching maximum number of iterations, current an iteration is updated in the process obtained W' as Current an iteration is updated obtained B ' in the process and is used as final binary coded matrix B " by final weight matrix W "；

3. being weighted quantization to each element in B " according to W ", the binary coded matrix Z after being weighted；

4. according to W " and B ", it obtainsQ' when minimum, as binary coding q " corresponding with q', It is searched in Z and the nearest row vector data of weighting functions of q ", row vector number that will be nearest with the weighting functions of q " According to corresponding original high dimensional data as final K-NN search as a result, completing the hash query process to q.

2. maximum number of iterations that the step is set in -1 is 50 time.

Compared with the prior art, the advantages of the present invention are as follows: first with Principal Component Analysis respectively by original high dimension According to collection and given inquiry Data Dimensionality Reduction, loss function is then constructed using low-dimensional vector according to guarantor's principle of similarity in pairs, Final binary coded matrix and final weight matrix are obtained by minimizing the function in iterative process, further according to final weight Matrix is weighted quantization to each element in final binary coded matrix, the binary coded matrix after being weighted； According to final binary coded matrix and final weight matrix, binary coding corresponding with given inquiry data is obtained, most Searched in the binary coded matrix after weighting afterwards with the given corresponding binary-coded weighting hamming of inquiry data away from From nearest column vector data, by the nearest row of binary-coded weighting functions corresponding with given inquiry data to The corresponding original high dimensional data of amount data is as final K-NN search as a result, completing to look into the Hash of given inquiry data Inquiry process is capable of the data information of better mining data concentration, is kept by replacing Hamming distances using weighting functions Affinity information between data, the accuracy and efficiency inquired given inquiry data greatly improve.

Detailed description of the invention

Fig. 1 is step flow diagram of the invention.

Specific embodiment

A kind of hash query method solved based on weight towards higher-dimension big data, comprising the following steps:

1. obtaining the original High Dimensional Data Set X being made of n original high dimensional datas and given inquiry data q, X being n × d The matrix of dimension carries out dimensionality reduction to X using Principal Component Analysis, obtains low-dimensional vector set V corresponding with X,Wherein, V is the matrix of n × c dimension, v_ijIndicate i-th of data jth dimension in original high dimensional data The corresponding low-dimensional vector element in V reuses Principal Component Analysis and carries out dimensionality reduction to q, obtains low-dimensional vector corresponding with q q'。

2. final binary coded matrix B " and final weight matrix W are obtained by iteration ", detailed process is as follows:

2. -1 setting maximum number of iterations, gives initial binary encoder matrix B, B ∈ { -1,1 } at random^n×c, random given Initial weight matrix W, W=diag (w₁, w₂... w_j..., w_c), wherein w_jIndicate the dimension weight of jth dimension, wherein set Maximum number of iterations can be 50 times；

3. being weighted quantization to each element in B " according to W ", the binary coded matrix Z after being weighted.

Claims

1. a kind of hash query method solved based on weight towards higher-dimension big data, it is characterised in that the following steps are included:

1. obtaining the original High Dimensional Data Set X being made of n original high dimensional datas and given inquiry data q, X being the square of n × d dimension Battle array carries out dimensionality reduction to X using Principal Component Analysis, obtains low-dimensional vector set V corresponding with X,Its In, V is the matrix of n × c dimension, v_ijIndicate i-th of data jth dimension corresponding low-dimensional element vector in V in original high dimensional data Element reuses Principal Component Analysis and carries out dimensionality reduction to q, obtains low-dimensional vector q' corresponding with q；

2. -2 start iterative process, during current an iteration, holding W first is constant, by rightIt carries out It minimizes to solve and B is updated, it willThe B updated when minimum is denoted as B ', Wherein, | | | |_FFor the F- norm sign for taking matrix,In 2 indicate squared symbols, b_ijIt indicates The corresponding binary coded value of i-th of data jth dimension in original high dimensional data；

2. -3 are updated the dimension weight of all dimensions during current an iteration, wherein to w_jThe mistake being updated Journey is as follows: the column vector of the jth column in B ' is denoted as β_j, the jth column column vector in V is denoted as γ_j’, then have B '={ β₁,β₂,… β_j…,β_c, V={ γ₁,γ₂,…γ_j…,γ_c, keep B ' constant, by rightMinimize and solves to w_j It is updated, wherein | | | |₂It, will for the 2- norm sign for taking matrixThe w updated when minimum_jIt is denoted as w_j', the weight matrix obtained after the dimension weight arranged in sequence of all dimensions updated during current an iteration is remembered For W'；

2. -4 judge whether the number of iterations reaches the maximum number of iterations of setting, if not up to maximum number of iterations, enables W= W', B=B ', return step 2. -2 start next iteration process, Simultaneous Iteration number adds 1, wherein in W=W' and B=B ' "=" is assignment；If reaching maximum number of iterations, current an iteration is updated to obtained W' in the process as final Current an iteration is updated obtained B ' in the process and is used as final binary coded matrix B " by weight matrix W "；

4. according to W " and B ", it obtainsQ' when minimum, as binary coding q " corresponding with q', in Z It searches and the nearest row vector data of weighting functions of q ", row vector data pair that will be nearest with the weighting functions of q " The original high dimensional data answered is as final K-NN search as a result, completing the hash query process to q.

2. a kind of hash query method solved based on weight towards higher-dimension big data according to claim 1, special Sign is 2. maximum number of iterations that the step is set in -1 as 50 times.