A kind of hash query method solved based on weight towards higher-dimension big data
Technical field
The present invention relates to a kind of hash query method, in particular to it is a kind of towards higher-dimension big data based on weight solve
Hash query method.
Background technique
Approximate near neighbor problem is a Computer Subject underlying issue.Under normal circumstances, Hash technology is to be able to solve greatly
A kind of effective ways of scale high dimensional data inquiry.In the related technology, the Hash of this data set is encoded, Hash technology is not
Consider the weight of each dimension, that is, think that each dimensionality weight is equal, but Hash coding different dimensions are for similitude between data
Influence is different, and the relevant technologies do not fully consider Hash code weight, is had much room for improvement.
Summary of the invention
Technical problem to be solved by the invention is to provide it is a kind of can characteristic for data sets coding is weighted
Quantization obtains corresponding Hash coding, reduces data space and calculates cost, improves the big towards higher-dimension of inquiry accuracy
The hash query method of data solved based on weight.
The technical scheme of the invention to solve the technical problem is: it is a kind of towards higher-dimension big data based on weight
The hash query method of solution, comprising the following steps:
1. obtaining the original High Dimensional Data Set X being made of n original high dimensional datas and given inquiry data q, X being n × d
The matrix of dimension carries out dimensionality reduction to X using Principal Component Analysis, obtains low-dimensional vector set V corresponding with X,Wherein, V is the matrix of n × c dimension, vijIndicate i-th of data jth dimension in original high dimensional data
The corresponding low-dimensional vector element in V reuses Principal Component Analysis and carries out dimensionality reduction to q, obtains low-dimensional vector corresponding with q
q';
2. obtaining final binary coded matrix B " and final weight matrix W " by iteration, detailed process is as follows:
2. -1 setting maximum number of iterations, gives initial binary encoder matrix B, B ∈ { -1,1 } at randomn×c, random given
Initial weight matrix W, W=diag (w1, w2... wj..., wc), wherein wjIndicate the dimension weight of jth dimension;
2. -2 start iterative process, during current an iteration, holding W first is constant, by right
Minimize solving and B is updated, it willThe B updated when minimum is denoted as B ', Wherein, | | | |FFor the F- norm sign for taking matrix,In 2 indicate squared symbols, bijIt indicates
The corresponding binary coded value of i-th of data jth dimension in original high dimensional data;
2. -3 are updated the dimension weight of all dimensions during current an iteration, wherein to wjIt is updated
Process it is as follows: by B ' jth column column vector be denoted as βj, the jth column column vector in V is denoted as γj, then have B '={ β1,
β2,…βj…,βc, V={ γ1,γ2,…γj…,γc, keep B ' constant, by right
,Minimize and solves to wjIt is updated, wherein | | | |2To take the 2- norm of matrix to accord with
Number, it willThe w updated when minimumjIt is denoted as wj', by all dimensions updated during current an iteration
Dimension weight arranged in sequence after obtained weight matrix be denoted as W';
2. -4 judge whether the number of iterations reaches the maximum number of iterations of setting, if not up to maximum number of iterations, enables
W=W', B=B ', 2. -2 beginning next iteration process, Simultaneous Iteration number add 1 to return step, wherein W=W' and B=B '
In "=" be assignment;If reaching maximum number of iterations, current an iteration is updated in the process obtained W' as
Current an iteration is updated obtained B ' in the process and is used as final binary coded matrix B " by final weight matrix W ";
3. being weighted quantization to each element in B " according to W ", the binary coded matrix Z after being weighted;
4. according to W " and B ", it obtainsQ' when minimum, as binary coding q " corresponding with q',
It is searched in Z and the nearest row vector data of weighting functions of q ", row vector number that will be nearest with the weighting functions of q "
According to corresponding original high dimensional data as final K-NN search as a result, completing the hash query process to q.
2. maximum number of iterations that the step is set in -1 is 50 time.
Compared with the prior art, the advantages of the present invention are as follows: first with Principal Component Analysis respectively by original high dimension
According to collection and given inquiry Data Dimensionality Reduction, loss function is then constructed using low-dimensional vector according to guarantor's principle of similarity in pairs,
Final binary coded matrix and final weight matrix are obtained by minimizing the function in iterative process, further according to final weight
Matrix is weighted quantization to each element in final binary coded matrix, the binary coded matrix after being weighted;
According to final binary coded matrix and final weight matrix, binary coding corresponding with given inquiry data is obtained, most
Searched in the binary coded matrix after weighting afterwards with the given corresponding binary-coded weighting hamming of inquiry data away from
From nearest column vector data, by the nearest row of binary-coded weighting functions corresponding with given inquiry data to
The corresponding original high dimensional data of amount data is as final K-NN search as a result, completing to look into the Hash of given inquiry data
Inquiry process is capable of the data information of better mining data concentration, is kept by replacing Hamming distances using weighting functions
Affinity information between data, the accuracy and efficiency inquired given inquiry data greatly improve.
Detailed description of the invention
Fig. 1 is step flow diagram of the invention.
Specific embodiment
A kind of hash query method solved based on weight towards higher-dimension big data, comprising the following steps:
1. obtaining the original High Dimensional Data Set X being made of n original high dimensional datas and given inquiry data q, X being n × d
The matrix of dimension carries out dimensionality reduction to X using Principal Component Analysis, obtains low-dimensional vector set V corresponding with X,Wherein, V is the matrix of n × c dimension, vijIndicate i-th of data jth dimension in original high dimensional data
The corresponding low-dimensional vector element in V reuses Principal Component Analysis and carries out dimensionality reduction to q, obtains low-dimensional vector corresponding with q
q'。
2. final binary coded matrix B " and final weight matrix W are obtained by iteration ", detailed process is as follows:
2. -1 setting maximum number of iterations, gives initial binary encoder matrix B, B ∈ { -1,1 } at randomn×c, random given
Initial weight matrix W, W=diag (w1, w2... wj..., wc), wherein wjIndicate the dimension weight of jth dimension, wherein set
Maximum number of iterations can be 50 times;
2. -2 start iterative process, during current an iteration, holding W first is constant, by right
Minimize solving and B is updated, it willThe B updated when minimum is denoted as B ', Wherein, | | | |FFor the F- norm sign for taking matrix,In 2 indicate squared symbols, bijIt indicates
The corresponding binary coded value of i-th of data jth dimension in original high dimensional data;
2. -3 are updated the dimension weight of all dimensions during current an iteration, wherein to wjIt is updated
Process it is as follows: by B ' jth column column vector be denoted as βj, the jth column column vector in V is denoted as γj, then have B '={ β1,
β2,…βj…,βc, V={ γ1,γ2,…γj…,γc, keep B ' constant, by right
,Minimize and solves to wjIt is updated, wherein | | | |2To take the 2- norm of matrix to accord with
Number, it willThe w updated when minimumjIt is denoted as wj', by all dimensions updated during current an iteration
Dimension weight arranged in sequence after obtained weight matrix be denoted as W';
2. -4 judge whether the number of iterations reaches the maximum number of iterations of setting, if not up to maximum number of iterations, enables
W=W', B=B ', 2. -2 beginning next iteration process, Simultaneous Iteration number add 1 to return step, wherein W=W' and B=B '
In "=" be assignment;If reaching maximum number of iterations, current an iteration is updated in the process obtained W' as
Current an iteration is updated obtained B ' in the process and is used as final binary coded matrix B " by final weight matrix W ";
3. being weighted quantization to each element in B " according to W ", the binary coded matrix Z after being weighted.
4. according to W " and B ", it obtainsQ' when minimum, as binary coding q " corresponding with q',
It is searched in Z and the nearest row vector data of weighting functions of q ", row vector number that will be nearest with the weighting functions of q "
According to corresponding original high dimensional data as final K-NN search as a result, completing the hash query process to q.