CN103605653A - Big data searching method based on sparse hash - Google Patents

Big data searching method based on sparse hash Download PDF

Info

Publication number
CN103605653A
CN103605653A CN201310457033.4A CN201310457033A CN103605653A CN 103605653 A CN103605653 A CN 103605653A CN 201310457033 A CN201310457033 A CN 201310457033A CN 103605653 A CN103605653 A CN 103605653A
Authority
CN
China
Prior art keywords
hash function
dimension
big data
training set
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310457033.4A
Other languages
Chinese (zh)
Other versions
CN103605653B (en
Inventor
朱晓峰
张师超
刘星毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN201310457033.4A priority Critical patent/CN103605653B/en
Publication of CN103605653A publication Critical patent/CN103605653A/en
Application granted granted Critical
Publication of CN103605653B publication Critical patent/CN103605653B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a big data similar searching method, in particular to a big data searching method based on sparse hash. The method is mainly used for conducting application development based on storage and searching of big data. The method includes first utilizing a sampling method to determine the size of a training set according to theory of a computer memory, learning the training set and learning a hash function for big data coding and binary coding of the training set, then conducting binary coding on the big data according to the learnt hash function. At the moment, online search application can be conducted, namely for one test case, first a binary code of the test case is obtained according to the obtained hash function, and then real-time search is conducted on the binary code of the big data. By means of the method, the big data searching time complexity is linear, the problem that manifold learning does not have an explicit function is solved, storage quantity of the big data is reduced to thousands of times, the method is easy to implement, and only some simple mathematical models are involved during code writing.

Description

Large data retrieval method based on sparse Hash
Technical field
The present invention relates to Computer Science and Technology field and areas of information technology, be specifically related to large data, particularly the sparse Hash of a kind of use carries out the large data retrieval method such as picture, text, music.
Background technology
Large data refer to cannot be at the data set that uses conventional instrument that data content is retrieved and managed under precondition.Data volume is large, data type is various, value density is low and processing speed is the feature of four highly significants of large data soon.The research that at present large data knowledge is found mainly concentrates on division, cluster, retrieval, increment (in batches, online or parallel) and learns this 4 aspects.
At present fewer to the research of large data retrieval issue handling.In the time of retrieval, user wishes from all data, to obtain fast own needed thing conventionally.The problem how this chooses with regard to relating to a speed and accuracy rate.In 20 years, even before 10 years, what researcher pursued was accuracy rate.Therefore, designed various tree-like result KD-tree, M-tree etc. carry out database retrieval accurately, and have obtained very large application.Nearly ten years, day by day universal along with network, the generation of large data, accurately retrieval can not meet user's needs.Lot of documents shows, if the dimension of data is less than 10 dimensions, accurately retrieves and is well positioned to meet user's actual needs.Once but dimension surpasses this threshold value or higher, accurately the complexity of retrieval is just very high, worst case reaches the complexity of the whole database of traversal, and this is obviously infeasible in actual applications.
In recent years, Approximate Retrieval has been obtained significant development, particularly network retrieval, and what user pursued is quick and approximate multimedia retrieval.In numerous Approximate Retrieval methods, hash method is the most outstanding.The principle of hash method is that the real number value data of higher-dimension are reduced to the similarity between low-dimensional binary data and save data, then large data sets is kept to calculator memory or outer disk as far as possible, with this, reaches the object of quick-searching.
Summary of the invention
The present invention studies large data Approximate Retrieval problem.
The object of the present invention is to provide simple and effective large data Approximate Retrieval algorithm.The method can solve the high complexity of large data retrieval and low accuracy rate etc.Be this method by keeping partial structurtes that the popular structure of data guarantees the original high dimensional data of scale-of-two maintenance as much as possible to improve Hash achievement, by effective optimization method, reduce algorithm complex to linear.The present invention comprises two critical process, i.e. hash function study and large data real-time retrieval.Wherein hash function study comprises that higher-dimension real number value data change into low-dimensional real number value and low-dimensional real number value changes into scale-of-two two processes of tieing up that wait.First large data real-time retrieval turn example according to the hash function obtaining is scale-of-two, then at calculator memory, retrieves.
The concrete steps of this method are as follows:
(1) from large data, data from the sample survey is used for training hash function as training set.Large data bulk is too huge, according to statistical theory, and need not be by all data as training set.First the present invention samples partial data as training set.And the training set extracting size n by determine, wherein t α/2the value that represents degree of confidence, can obtain by the t critical value that distributes, and ε represents maximum permissible error.Various parameter settings please see the following form.
So far, obtain training set X.
(2) with X, train hash function.First design object function turns higher-dimension real data to low dimension data.Objective function is defined as:
Figure BDA0000390244510000023
wherein X is training set, and B is base space, and each vector of B is training base vector out from training set X, and S is that X is projected in the low-dimensional real number value in base space B, λ 1and λ 2the adjustable parameter of obtaining by ten folding cross validation methods, w i,jtwo example x in X iand x jbetween the projection of Euclidean distance in gaussian kernel, s iand s jtwo vectors in matrix S, B i,jthe capable and j column element of i in matrix B, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, and n is the number of example, and k is the number of base vector,
Figure BDA0000390244510000024
represent that in S, each element is non-negative.
First || X-BS|| 2target be that reconstruct training set X obtains S and reconfiguring false is wished minimum in base space B; Second Σ i,jw i,j|| s i-s j|| 2be the localized epidemics's result that keeps original training set X, this guarantees the similarity of the original high dimensional data of maintenance of binary data, thereby guarantees the achievement of Hash; The 3rd S that assurance obtains is sparse; The 4th
Figure BDA0000390244510000025
it is non-negative guaranteeing to obtain S.According to this objective function, the S obtaining is that the low-dimensional of X represents.The second step of training hash function converts binary code to S: in S, non-zero element converts 0 to, otherwise is 1.The 3rd step of training hash function obtains hash function.The dimension of supposing S is d, and the dimension of X is D, and (D>>d), binary-coded length is d.In d dimension, every one dimension is as a vector, and this vector is scale-of-two (i.e. two class problems in classification), and the present invention is that every one dimension is set up a hash function, sets up altogether d hash function.The process of setting up hash function is very simple, and finding cryptographic hash in training set X is that 1 example is class A entirely m1, m=1 ..., d, the example that residue cryptographic hash is 0 is classified as class A m0, m=1 ..., d, obtains 2d class, and hash function is defined as:
sign ( x i ) = arg min i { | | x i - A ij s i | | 2 , j = 0,1 ,
If the dimension of S is d, the dimension of X is D, D>>d, and in d dimension, every one dimension is a binary vector, for every one dimension in d dimension, sets up a hash function, sets up altogether d hash function;
In formula, X ii the vector of matrix X, S ii vector of matrix S, i=1 ..., n.
(3) to also not obtaining the example of binary code in large data sets, do not carry out binary coding process and be: to each example x, by s=(B'B+2I) -1b'x obtains the low-dimensional real number value of x, then by hash function, obtains its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.Like this, whole large data are encoded, make large data can be stored in calculator memory or outer disk.
(4) to new test case xt, pass through s t=(B'B+2I) -1b'x tobtain x tlow-dimensional real number value, then by hash function, obtain its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.Finally, follow the binary code of large data to carry out similarity searching the binary code of test case, to obtain its similar example.
Wherein step of the present invention (2) is crucial, guarantees efficiency and the effect of algorithm.Its algorithm complex is closely the cube of dimension D.In large market demand, the size of dimension D is much smaller more than example quantity, so algorithm complex of the present invention is the linear relationship of example size.And because step of the present invention (2) has considered to keep the flow structure of data, the effect of algorithm can be protected.Due to what generate, be non-negative low-dimensional real number value in addition, make the result obtaining be convenient to explain.
The large data retrieval model of sparse Hash of the present invention is characterised in that: use the method for Corresponding Sparse Algorithm and sampling to reduce algorithm complex; Use the theoretical hash function that generates of manifold learning to improve Hash achievement; Generate explicit hash function and avoid the implicit expression hash function of manifold learning; The Hash result that binarization principle is is soluble; The storage problem of large data is significantly reduced.
The large data of sampling: it is very difficult conventionally carrying out various data minings study in whole large data.Even feasible, complexity is also very high, and the methods of sampling makes the operation of large Data classification become feasible, and makes the reduced complexity of classification to linear.This result that large data mining is expected just.
Theory of manifolds embeds Hash learning model: theory of manifolds has been proved to be a kind of very effective partial structurtes keeping method, and the method is particularly important to setting up Hash model.The present invention adds the manifold regularization factor in the process of study Hash.Primary and foremost purpose is to keep the stream shape result of data set to guarantee high Hash achievement, is secondly that explicit expression that futuramic optimization method has obtained hash function has solved in the past manifold learning without the difficulty of explicit expression;
The interpretation of binarization.Turning low-dimensional real number value for dimension binary time, owing to adopting non-negative indication and novel scale-of-two to transform, the binary representation that makes to obtain has interpretation and similarity continues to keep.This is different from the binarizing method of existing hash method;
Low complex degree: owing to adopting efficient optimization method and the methods of sampling, the process complexity of learning hash function is had nothing to do with large data instance quantity, the complexity under worst condition is linear.
Low storage capacity: because the use binary code of innovation is replaced the storage means of formal data, make the storage of large data save the space of up to ten thousand times.
Accompanying drawing explanation
Fig. 1 is the dimensionality reduction result of a test case;
Fig. 2 is the binary code of the picture of Fig. 1.
Embodiment
From network, intercept 70,000 animal pictures at random, suppose that every pictures needs the storage space of 1M, (noting this picture has not been the picture of very fidelity), so whole data set needs the storage of 70G space.The present invention replaces every pictures to replace with 4 binary codes, and total mistake only needs the storage space of about 3.5K.Than original storage, nearly 20000 times have been economized like this.
(1) because common 4G internal memory computer can be processed algorithm 100,000 examples of the present invention.Therefore to this data set, the present invention need not sample, and directly with 70,000 data sets, trains and obtains hash function.And finally obtain each example and use 4 binary representations.
(2), to each test case, the low-dimensional real number value that first the present invention obtains it represents: be 0.4,0,0.1,0.7, and (see figure 1).
This expression: 1) be reduced to 4 dimensions from 784 dimensions of original description picture; 2) kept its partial structurtes, the neighbours of its luv space are lower dimensional space or neighbours; 3) non-negative, this makes the inventive method have clear and definite semanteme is interpretation.According to upper figure, it is considered herein that the picture of monkey can be formed by four basic weight structures, the weight of each base is exactly that its thinking coordinate represents, i.e. (0.4,0,0.1,0.7), obvious the second dimensional weight is 0, can say that this picture is not formed by second base.The binarization principle according to the present invention, the binary code of this picture is: 1,0,1,1, (see figure 2).
(3), according to this binary code, the present invention also shows that this picture is not formed by second base.Therefore the process of the present invention's coding is explainable.And be easy to prove that similarity of the present invention keeps problem.For example two four-dimensional pictures are respectively (0.51,0.51,0.51,0.51) and (0.49,0.49,0.49,0.49), and the present invention is encoded to them (1,1,1,1) and (1,1,1,1).Obviously in their Euclidean distance of real number value space, show that they are similar.The result obtaining by coding of the present invention is also similar.But adopt common Hash encoding law, these two pictures are encoded into (1,1,1,1) and (0,0,0,0).Obviously the similarity of luv space can not be held in scale-of-two (being hamming) space.This shows that it is effective that similarity of the present invention keeps.

Claims (7)

1. the large data retrieval method based on sparse Hash, comprises the steps:
(1) from large data, data from the sample survey is regarded training set X;
(2) with X, train hash function;
(3) to also not obtaining the example of binary code in large data sets, do not carry out binary coding, and by coding after large data storing calculator memory or outside disk;
(4) to new test case, first obtain its low-dimensional real number value, then obtain its low-dimensional binary code, last, follow the binary code of large data to carry out similarity searching the binary code of test case, obtain its similar example.
2. according to the method for claim 1, the training set of the training set X of described step (1) size n by
Figure FDA0000390244500000011
determine, wherein t α/2the value that represents degree of confidence, obtains by t distribution critical value, and ε represents the maximum permissible error of setting.
3. according to the method for claim 1, described step (2) comprises following process:
A). set up objective function:
Figure FDA0000390244500000012
wherein X is training set, and B is base space, and each vector of B is training base vector out from training set X, and S is that X is projected in the low-dimensional real number value in base space B, λ 1and λ 2the adjustable parameter of obtaining by ten folding cross validation methods, w i,jtwo example x in X iand x jbetween the projection of Euclidean distance in gaussian kernel, s iand s jtwo vectors in matrix S, B i,jthe capable and j column element of i in matrix B, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, and n is the number of example, and k is the number of base vector,
Figure FDA0000390244500000014
represent that in S, each element is non-negative;
B). S is converted to binary code;
C). set up hash function.
4. according to the method for claim 3, described process B) in, in S, non-zero element converts 0 to, otherwise is 1.
5. the process of setting up hash function according to the method for claim 3, described process C) is: finding cryptographic hash in training set X is that 1 example is class A entirely m1, m=1 ..., d, the example that residue cryptographic hash is 0 is classified as class A m0, m=1 ..., d, obtains 2d class, and hash function is defined as: sign ( x i ) = arg min i { | | x i - A ij s i | | 2 , j = 0,1 ,
If the dimension of S is d, the dimension of X is D, D>>d, and in d dimension, every one dimension is a binary vector, for every one dimension in d dimension, sets up a hash function, sets up altogether d hash function;
In formula, X ii the vector of matrix X, S ii vector of matrix S, i=1 ..., n.
6. according to the method for claim 1, described step (3) is to each example x of large data, by s=(B'B+2I) -1b'x obtains the low-dimensional real number value of x, then by hash function, obtains its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.
7. according to the method for claim 1, described step (4) is to each example x of test data set t, pass through s t=(B'B+2I) -1b'x tobtain x tlow-dimensional real number value, then by hash function, obtain its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.
CN201310457033.4A 2013-09-29 2013-09-29 Big data retrieval method based on sparse hash Expired - Fee Related CN103605653B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310457033.4A CN103605653B (en) 2013-09-29 2013-09-29 Big data retrieval method based on sparse hash

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310457033.4A CN103605653B (en) 2013-09-29 2013-09-29 Big data retrieval method based on sparse hash

Publications (2)

Publication Number Publication Date
CN103605653A true CN103605653A (en) 2014-02-26
CN103605653B CN103605653B (en) 2017-01-04

Family

ID=50123878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310457033.4A Expired - Fee Related CN103605653B (en) 2013-09-29 2013-09-29 Big data retrieval method based on sparse hash

Country Status (1)

Country Link
CN (1) CN103605653B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462458A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Data mining method of big data system
CN104462459A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Neural network based big data analysis and processing system and method
CN104484566A (en) * 2014-12-16 2015-04-01 芜湖乐锐思信息咨询有限公司 Big data analysis system and big data analysis method
CN113377294A (en) * 2021-08-11 2021-09-10 武汉泰乐奇信息科技有限公司 Big data storage method and device based on binary data conversion

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402617A (en) * 2011-12-23 2012-04-04 天津神舟通用数据技术有限公司 Easily compressed database index storage system using fragments and sparse bitmap, and corresponding construction, scheduling and query processing methods

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402617A (en) * 2011-12-23 2012-04-04 天津神舟通用数据技术有限公司 Easily compressed database index storage system using fragments and sparse bitmap, and corresponding construction, scheduling and query processing methods

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XIAOFENG ZHU,ETC.: "Sparse Hashing for Fast Multimedia Search", 《ACM TRANSACTIONS ON INFORMATION SYSTEMS》, vol. 31, no. 2, 31 May 2013 (2013-05-31), XP058018274, DOI: http://dx.doi.org/10.1145/2457465.2457469 *
张啸: "基于稀疏谱哈希的图像索引", 《中国优秀硕士学位论文全文数据库》, no. 7, 15 July 2011 (2011-07-15), pages 138 - 506 *
欧阳遄飞: "基于结构化稀疏谱哈希的图像索引算法", 《中国优秀硕士学位论文全文数据库》, no. 7, 15 July 2012 (2012-07-15), pages 138 - 2166 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462458A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Data mining method of big data system
CN104462459A (en) * 2014-12-16 2015-03-25 芜湖乐锐思信息咨询有限公司 Neural network based big data analysis and processing system and method
CN104484566A (en) * 2014-12-16 2015-04-01 芜湖乐锐思信息咨询有限公司 Big data analysis system and big data analysis method
CN113377294A (en) * 2021-08-11 2021-09-10 武汉泰乐奇信息科技有限公司 Big data storage method and device based on binary data conversion
CN113377294B (en) * 2021-08-11 2021-10-22 武汉泰乐奇信息科技有限公司 Big data storage method and device based on binary data conversion

Also Published As

Publication number Publication date
CN103605653B (en) 2017-01-04

Similar Documents

Publication Publication Date Title
Xie et al. Generative pointnet: Deep energy-based learning on unordered point sets for 3d generation, reconstruction and classification
Zafar et al. A novel discriminating and relative global spatial image representation with applications in CBIR
Ding et al. Cross-modal hashing via rank-order preserving
Xie et al. Contextual query expansion for image retrieval
Zhang et al. Pointwise geometric and semantic learning network on 3D point clouds
EP3166020A1 (en) Method and apparatus for image classification based on dictionary learning
Angrish et al. MVCNN++: computer-aided design model shape classification and retrieval using multi-view convolutional neural networks
Ali et al. Modeling global geometric spatial information for rotation invariant classification of satellite images
Sadeghi-Tehran et al. Scalable database indexing and fast image retrieval based on deep learning and hierarchically nested structure applied to remote sensing and plant biology
CN104199842A (en) Similar image retrieval method based on local feature neighborhood information
Bu et al. Local deep feature learning framework for 3D shape
Luo et al. Asymmetric discrete cross-modal hashing
CN103473307A (en) Cross-media sparse Hash indexing method
Zhang et al. Scalable discrete matrix factorization and semantic autoencoder for cross-media retrieval
Hou et al. Hitpr: Hierarchical transformer for place recognition in point cloud
CN103605653A (en) Big data searching method based on sparse hash
Pont et al. Principal geodesic analysis of merge trees (and persistence diagrams)
CN105760875A (en) Binary image feature similarity discrimination method based on random forest algorithm
Wang et al. Improving deep learning on point cloud by maximizing mutual information across layers
Li et al. Sketch4Image: a novel framework for sketch-based image retrieval based on product quantization with coding residuals
Lv et al. Retrieval oriented deep feature learning with complementary supervision mining
Zhang et al. Modeling spatial and semantic cues for large-scale near-duplicated image retrieval
CN103324942A (en) Method, device and system for image classification
Li et al. ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval.
Zhao et al. MapReduce-based clustering for near-duplicate image identification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170104