CN103605653A

CN103605653A - Big data searching method based on sparse hash

Info

Publication number: CN103605653A
Application number: CN201310457033.4A
Authority: CN
Inventors: 朱晓峰; 张师超; 刘星毅
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2013-09-29
Filing date: 2013-09-29
Publication date: 2014-02-26
Anticipated expiration: 2033-09-29
Also published as: CN103605653B

Abstract

The invention relates to a big data similar searching method, in particular to a big data searching method based on sparse hash. The method is mainly used for conducting application development based on storage and searching of big data. The method includes first utilizing a sampling method to determine the size of a training set according to theory of a computer memory, learning the training set and learning a hash function for big data coding and binary coding of the training set, then conducting binary coding on the big data according to the learnt hash function. At the moment, online search application can be conducted, namely for one test case, first a binary code of the test case is obtained according to the obtained hash function, and then real-time search is conducted on the binary code of the big data. By means of the method, the big data searching time complexity is linear, the problem that manifold learning does not have an explicit function is solved, storage quantity of the big data is reduced to thousands of times, the method is easy to implement, and only some simple mathematical models are involved during code writing.

Description

Large data retrieval method based on sparse Hash

Technical field

The present invention relates to Computer Science and Technology field and areas of information technology, be specifically related to large data, particularly the sparse Hash of a kind of use carries out the large data retrieval method such as picture, text, music.

Background technology

Large data refer to cannot be at the data set that uses conventional instrument that data content is retrieved and managed under precondition.Data volume is large, data type is various, value density is low and processing speed is the feature of four highly significants of large data soon.The research that at present large data knowledge is found mainly concentrates on division, cluster, retrieval, increment (in batches, online or parallel) and learns this 4 aspects.

At present fewer to the research of large data retrieval issue handling.In the time of retrieval, user wishes from all data, to obtain fast own needed thing conventionally.The problem how this chooses with regard to relating to a speed and accuracy rate.In 20 years, even before 10 years, what researcher pursued was accuracy rate.Therefore, designed various tree-like result KD-tree, M-tree etc. carry out database retrieval accurately, and have obtained very large application.Nearly ten years, day by day universal along with network, the generation of large data, accurately retrieval can not meet user's needs.Lot of documents shows, if the dimension of data is less than 10 dimensions, accurately retrieves and is well positioned to meet user's actual needs.Once but dimension surpasses this threshold value or higher, accurately the complexity of retrieval is just very high, worst case reaches the complexity of the whole database of traversal, and this is obviously infeasible in actual applications.

In recent years, Approximate Retrieval has been obtained significant development, particularly network retrieval, and what user pursued is quick and approximate multimedia retrieval.In numerous Approximate Retrieval methods, hash method is the most outstanding.The principle of hash method is that the real number value data of higher-dimension are reduced to the similarity between low-dimensional binary data and save data, then large data sets is kept to calculator memory or outer disk as far as possible, with this, reaches the object of quick-searching.

Summary of the invention

The present invention studies large data Approximate Retrieval problem.

The object of the present invention is to provide simple and effective large data Approximate Retrieval algorithm.The method can solve the high complexity of large data retrieval and low accuracy rate etc.Be this method by keeping partial structurtes that the popular structure of data guarantees the original high dimensional data of scale-of-two maintenance as much as possible to improve Hash achievement, by effective optimization method, reduce algorithm complex to linear.The present invention comprises two critical process, i.e. hash function study and large data real-time retrieval.Wherein hash function study comprises that higher-dimension real number value data change into low-dimensional real number value and low-dimensional real number value changes into scale-of-two two processes of tieing up that wait.First large data real-time retrieval turn example according to the hash function obtaining is scale-of-two, then at calculator memory, retrieves.

The concrete steps of this method are as follows:

(1) from large data, data from the sample survey is used for training hash function as training set.Large data bulk is too huge, according to statistical theory, and need not be by all data as training set.First the present invention samples partial data as training set.And the training set extracting size n by determine, wherein t _α/2the value that represents degree of confidence, can obtain by the t critical value that distributes, and ε represents maximum permissible error.Various parameter settings please see the following form.

So far, obtain training set X.

(2) with X, train hash function.First design object function turns higher-dimension real data to low dimension data.Objective function is defined as:

wherein X is training set, and B is base space, and each vector of B is training base vector out from training set X, and S is that X is projected in the low-dimensional real number value in base space B, λ ₁and λ ₂the adjustable parameter of obtaining by ten folding cross validation methods, w _i,jtwo example x in X _iand x _jbetween the projection of Euclidean distance in gaussian kernel, s _iand s _jtwo vectors in matrix S, B _i,jthe capable and j column element of i in matrix B, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, and n is the number of example, and k is the number of base vector,

represent that in S, each element is non-negative.

First || X-BS|| ₂target be that reconstruct training set X obtains S and reconfiguring false is wished minimum in base space B; Second Σ _i,jw _i,j|| s _i-s _j|| ²be the localized epidemics's result that keeps original training set X, this guarantees the similarity of the original high dimensional data of maintenance of binary data, thereby guarantees the achievement of Hash; The 3rd S that assurance obtains is sparse; The 4th

it is non-negative guaranteeing to obtain S.According to this objective function, the S obtaining is that the low-dimensional of X represents.The second step of training hash function converts binary code to S: in S, non-zero element converts 0 to, otherwise is 1.The 3rd step of training hash function obtains hash function.The dimension of supposing S is d, and the dimension of X is D, and (D>>d), binary-coded length is d.In d dimension, every one dimension is as a vector, and this vector is scale-of-two (i.e. two class problems in classification), and the present invention is that every one dimension is set up a hash function, sets up altogether d hash function.The process of setting up hash function is very simple, and finding cryptographic hash in training set X is that 1 example is class A entirely _m1, m=1 ..., d, the example that residue cryptographic hash is 0 is classified as class A _m0, m=1 ..., d, obtains 2d class, and hash function is defined as:

sign (x_{i}) = \arg \min_{i} {{| | x_{i} - A_{ij} s_{i} | |}_{2}, j = 0,1,

If the dimension of S is d, the dimension of X is D, D>>d, and in d dimension, every one dimension is a binary vector, for every one dimension in d dimension, sets up a hash function, sets up altogether d hash function;

In formula, X _ii the vector of matrix X, S _ii vector of matrix S, i=1 ..., n.

(3) to also not obtaining the example of binary code in large data sets, do not carry out binary coding process and be: to each example x, by s=(B'B+2I) ^-1b'x obtains the low-dimensional real number value of x, then by hash function, obtains its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.Like this, whole large data are encoded, make large data can be stored in calculator memory or outer disk.

(4) to new test case xt, pass through s _t=(B'B+2I) ^-1b'x _tobtain x _tlow-dimensional real number value, then by hash function, obtain its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.Finally, follow the binary code of large data to carry out similarity searching the binary code of test case, to obtain its similar example.

Wherein step of the present invention (2) is crucial, guarantees efficiency and the effect of algorithm.Its algorithm complex is closely the cube of dimension D.In large market demand, the size of dimension D is much smaller more than example quantity, so algorithm complex of the present invention is the linear relationship of example size.And because step of the present invention (2) has considered to keep the flow structure of data, the effect of algorithm can be protected.Due to what generate, be non-negative low-dimensional real number value in addition, make the result obtaining be convenient to explain.

The large data retrieval model of sparse Hash of the present invention is characterised in that: use the method for Corresponding Sparse Algorithm and sampling to reduce algorithm complex; Use the theoretical hash function that generates of manifold learning to improve Hash achievement; Generate explicit hash function and avoid the implicit expression hash function of manifold learning; The Hash result that binarization principle is is soluble; The storage problem of large data is significantly reduced.

The large data of sampling: it is very difficult conventionally carrying out various data minings study in whole large data.Even feasible, complexity is also very high, and the methods of sampling makes the operation of large Data classification become feasible, and makes the reduced complexity of classification to linear.This result that large data mining is expected just.

Theory of manifolds embeds Hash learning model: theory of manifolds has been proved to be a kind of very effective partial structurtes keeping method, and the method is particularly important to setting up Hash model.The present invention adds the manifold regularization factor in the process of study Hash.Primary and foremost purpose is to keep the stream shape result of data set to guarantee high Hash achievement, is secondly that explicit expression that futuramic optimization method has obtained hash function has solved in the past manifold learning without the difficulty of explicit expression;

The interpretation of binarization.Turning low-dimensional real number value for dimension binary time, owing to adopting non-negative indication and novel scale-of-two to transform, the binary representation that makes to obtain has interpretation and similarity continues to keep.This is different from the binarizing method of existing hash method;

Low complex degree: owing to adopting efficient optimization method and the methods of sampling, the process complexity of learning hash function is had nothing to do with large data instance quantity, the complexity under worst condition is linear.

Low storage capacity: because the use binary code of innovation is replaced the storage means of formal data, make the storage of large data save the space of up to ten thousand times.

Accompanying drawing explanation

Fig. 1 is the dimensionality reduction result of a test case;

Fig. 2 is the binary code of the picture of Fig. 1.

Embodiment

From network, intercept 70,000 animal pictures at random, suppose that every pictures needs the storage space of 1M, (noting this picture has not been the picture of very fidelity), so whole data set needs the storage of 70G space.The present invention replaces every pictures to replace with 4 binary codes, and total mistake only needs the storage space of about 3.5K.Than original storage, nearly 20000 times have been economized like this.

(1) because common 4G internal memory computer can be processed algorithm 100,000 examples of the present invention.Therefore to this data set, the present invention need not sample, and directly with 70,000 data sets, trains and obtains hash function.And finally obtain each example and use 4 binary representations.

(2), to each test case, the low-dimensional real number value that first the present invention obtains it represents: be 0.4,0,0.1,0.7, and (see figure 1).

This expression: 1) be reduced to 4 dimensions from 784 dimensions of original description picture; 2) kept its partial structurtes, the neighbours of its luv space are lower dimensional space or neighbours; 3) non-negative, this makes the inventive method have clear and definite semanteme is interpretation.According to upper figure, it is considered herein that the picture of monkey can be formed by four basic weight structures, the weight of each base is exactly that its thinking coordinate represents, i.e. (0.4,0,0.1,0.7), obvious the second dimensional weight is 0, can say that this picture is not formed by second base.The binarization principle according to the present invention, the binary code of this picture is: 1,0,1,1, (see figure 2).

(3), according to this binary code, the present invention also shows that this picture is not formed by second base.Therefore the process of the present invention's coding is explainable.And be easy to prove that similarity of the present invention keeps problem.For example two four-dimensional pictures are respectively (0.51,0.51,0.51,0.51) and (0.49,0.49,0.49,0.49), and the present invention is encoded to them (1,1,1,1) and (1,1,1,1).Obviously in their Euclidean distance of real number value space, show that they are similar.The result obtaining by coding of the present invention is also similar.But adopt common Hash encoding law, these two pictures are encoded into (1,1,1,1) and (0,0,0,0).Obviously the similarity of luv space can not be held in scale-of-two (being hamming) space.This shows that it is effective that similarity of the present invention keeps.

Claims

1. the large data retrieval method based on sparse Hash, comprises the steps:

(1) from large data, data from the sample survey is regarded training set X;

(2) with X, train hash function;

(3) to also not obtaining the example of binary code in large data sets, do not carry out binary coding, and by coding after large data storing calculator memory or outside disk;

(4) to new test case, first obtain its low-dimensional real number value, then obtain its low-dimensional binary code, last, follow the binary code of large data to carry out similarity searching the binary code of test case, obtain its similar example.

2. according to the method for claim 1, the training set of the training set X of described step (1) size n by

determine, wherein t _α/2the value that represents degree of confidence, obtains by t distribution critical value, and ε represents the maximum permissible error of setting.

3. according to the method for claim 1, described step (2) comprises following process:

A). set up objective function:

represent that in S, each element is non-negative;

B). S is converted to binary code;

C). set up hash function.

4. according to the method for claim 3, described process B) in, in S, non-zero element converts 0 to, otherwise is 1.

5. the process of setting up hash function according to the method for claim 3, described process C) is: finding cryptographic hash in training set X is that 1 example is class A entirely _m1, m=1 ..., d, the example that residue cryptographic hash is 0 is classified as class A _m0, m=1 ..., d, obtains 2d class, and hash function is defined as:

sign (x_{i}) = \arg \min_{i} {{| | x_{i} - A_{ij} s_{i} | |}_{2}, j = 0,1,

In formula, X _ii the vector of matrix X, S _ii vector of matrix S, i=1 ..., n.

6. according to the method for claim 1, described step (3) is to each example x of large data, by s=(B'B+2I) ^-1b'x obtains the low-dimensional real number value of x, then by hash function, obtains its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.

7. according to the method for claim 1, described step (4) is to each example x of test data set _t, pass through s _t=(B'B+2I) ^-1b'x _tobtain x _tlow-dimensional real number value, then by hash function, obtain its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.