CN103605653A - Big data searching method based on sparse hash - Google Patents
Big data searching method based on sparse hash Download PDFInfo
- Publication number
- CN103605653A CN103605653A CN201310457033.4A CN201310457033A CN103605653A CN 103605653 A CN103605653 A CN 103605653A CN 201310457033 A CN201310457033 A CN 201310457033A CN 103605653 A CN103605653 A CN 103605653A
- Authority
- CN
- China
- Prior art keywords
- hash function
- dimension
- big data
- training set
- hash
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000006870 function Effects 0.000 claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 24
- 238000012360 testing method Methods 0.000 claims abstract description 9
- 239000013598 vector Substances 0.000 claims description 18
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000002790 cross-validation Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 abstract description 5
- 238000011161 development Methods 0.000 abstract description 2
- 238000013178 mathematical model Methods 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000007418 data mining Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000282693 Cercopithecidae Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2255—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a big data similar searching method, in particular to a big data searching method based on sparse hash. The method is mainly used for conducting application development based on storage and searching of big data. The method includes first utilizing a sampling method to determine the size of a training set according to theory of a computer memory, learning the training set and learning a hash function for big data coding and binary coding of the training set, then conducting binary coding on the big data according to the learnt hash function. At the moment, online search application can be conducted, namely for one test case, first a binary code of the test case is obtained according to the obtained hash function, and then real-time search is conducted on the binary code of the big data. By means of the method, the big data searching time complexity is linear, the problem that manifold learning does not have an explicit function is solved, storage quantity of the big data is reduced to thousands of times, the method is easy to implement, and only some simple mathematical models are involved during code writing.
Description
Technical field
The present invention relates to Computer Science and Technology field and areas of information technology, be specifically related to large data, particularly the sparse Hash of a kind of use carries out the large data retrieval method such as picture, text, music.
Background technology
Large data refer to cannot be at the data set that uses conventional instrument that data content is retrieved and managed under precondition.Data volume is large, data type is various, value density is low and processing speed is the feature of four highly significants of large data soon.The research that at present large data knowledge is found mainly concentrates on division, cluster, retrieval, increment (in batches, online or parallel) and learns this 4 aspects.
At present fewer to the research of large data retrieval issue handling.In the time of retrieval, user wishes from all data, to obtain fast own needed thing conventionally.The problem how this chooses with regard to relating to a speed and accuracy rate.In 20 years, even before 10 years, what researcher pursued was accuracy rate.Therefore, designed various tree-like result KD-tree, M-tree etc. carry out database retrieval accurately, and have obtained very large application.Nearly ten years, day by day universal along with network, the generation of large data, accurately retrieval can not meet user's needs.Lot of documents shows, if the dimension of data is less than 10 dimensions, accurately retrieves and is well positioned to meet user's actual needs.Once but dimension surpasses this threshold value or higher, accurately the complexity of retrieval is just very high, worst case reaches the complexity of the whole database of traversal, and this is obviously infeasible in actual applications.
In recent years, Approximate Retrieval has been obtained significant development, particularly network retrieval, and what user pursued is quick and approximate multimedia retrieval.In numerous Approximate Retrieval methods, hash method is the most outstanding.The principle of hash method is that the real number value data of higher-dimension are reduced to the similarity between low-dimensional binary data and save data, then large data sets is kept to calculator memory or outer disk as far as possible, with this, reaches the object of quick-searching.
Summary of the invention
The present invention studies large data Approximate Retrieval problem.
The object of the present invention is to provide simple and effective large data Approximate Retrieval algorithm.The method can solve the high complexity of large data retrieval and low accuracy rate etc.Be this method by keeping partial structurtes that the popular structure of data guarantees the original high dimensional data of scale-of-two maintenance as much as possible to improve Hash achievement, by effective optimization method, reduce algorithm complex to linear.The present invention comprises two critical process, i.e. hash function study and large data real-time retrieval.Wherein hash function study comprises that higher-dimension real number value data change into low-dimensional real number value and low-dimensional real number value changes into scale-of-two two processes of tieing up that wait.First large data real-time retrieval turn example according to the hash function obtaining is scale-of-two, then at calculator memory, retrieves.
The concrete steps of this method are as follows:
(1) from large data, data from the sample survey is used for training hash function as training set.Large data bulk is too huge, according to statistical theory, and need not be by all data as training set.First the present invention samples partial data as training set.And the training set extracting size n by
determine, wherein t
α/2the value that represents degree of confidence, can obtain by the t critical value that distributes, and ε represents maximum permissible error.Various parameter settings please see the following form.
So far, obtain training set X.
(2) with X, train hash function.First design object function turns higher-dimension real data to low dimension data.Objective function is defined as:
wherein X is training set, and B is base space, and each vector of B is training base vector out from training set X, and S is that X is projected in the low-dimensional real number value in base space B, λ
1and λ
2the adjustable parameter of obtaining by ten folding cross validation methods, w
i,jtwo example x in X
iand x
jbetween the projection of Euclidean distance in gaussian kernel, s
iand s
jtwo vectors in matrix S, B
i,jthe capable and j column element of i in matrix B, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, and n is the number of example, and k is the number of base vector,
represent that in S, each element is non-negative.
First || X-BS||
2target be that reconstruct training set X obtains S and reconfiguring false is wished minimum in base space B; Second Σ
i,jw
i,j|| s
i-s
j||
2be the localized epidemics's result that keeps original training set X, this guarantees the similarity of the original high dimensional data of maintenance of binary data, thereby guarantees the achievement of Hash; The 3rd S that assurance obtains is sparse; The 4th
it is non-negative guaranteeing to obtain S.According to this objective function, the S obtaining is that the low-dimensional of X represents.The second step of training hash function converts binary code to S: in S, non-zero element converts 0 to, otherwise is 1.The 3rd step of training hash function obtains hash function.The dimension of supposing S is d, and the dimension of X is D, and (D>>d), binary-coded length is d.In d dimension, every one dimension is as a vector, and this vector is scale-of-two (i.e. two class problems in classification), and the present invention is that every one dimension is set up a hash function, sets up altogether d hash function.The process of setting up hash function is very simple, and finding cryptographic hash in training set X is that 1 example is class A entirely
m1, m=1 ..., d, the example that residue cryptographic hash is 0 is classified as class A
m0, m=1 ..., d, obtains 2d class, and hash function is defined as:
If the dimension of S is d, the dimension of X is D, D>>d, and in d dimension, every one dimension is a binary vector, for every one dimension in d dimension, sets up a hash function, sets up altogether d hash function;
In formula, X
ii the vector of matrix X, S
ii vector of matrix S, i=1 ..., n.
(3) to also not obtaining the example of binary code in large data sets, do not carry out binary coding process and be: to each example x, by s=(B'B+2I)
-1b'x obtains the low-dimensional real number value of x, then by hash function, obtains its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.Like this, whole large data are encoded, make large data can be stored in calculator memory or outer disk.
(4) to new test case xt, pass through s
t=(B'B+2I)
-1b'x
tobtain x
tlow-dimensional real number value, then by hash function, obtain its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.Finally, follow the binary code of large data to carry out similarity searching the binary code of test case, to obtain its similar example.
Wherein step of the present invention (2) is crucial, guarantees efficiency and the effect of algorithm.Its algorithm complex is closely the cube of dimension D.In large market demand, the size of dimension D is much smaller more than example quantity, so algorithm complex of the present invention is the linear relationship of example size.And because step of the present invention (2) has considered to keep the flow structure of data, the effect of algorithm can be protected.Due to what generate, be non-negative low-dimensional real number value in addition, make the result obtaining be convenient to explain.
The large data retrieval model of sparse Hash of the present invention is characterised in that: use the method for Corresponding Sparse Algorithm and sampling to reduce algorithm complex; Use the theoretical hash function that generates of manifold learning to improve Hash achievement; Generate explicit hash function and avoid the implicit expression hash function of manifold learning; The Hash result that binarization principle is is soluble; The storage problem of large data is significantly reduced.
The large data of sampling: it is very difficult conventionally carrying out various data minings study in whole large data.Even feasible, complexity is also very high, and the methods of sampling makes the operation of large Data classification become feasible, and makes the reduced complexity of classification to linear.This result that large data mining is expected just.
Theory of manifolds embeds Hash learning model: theory of manifolds has been proved to be a kind of very effective partial structurtes keeping method, and the method is particularly important to setting up Hash model.The present invention adds the manifold regularization factor in the process of study Hash.Primary and foremost purpose is to keep the stream shape result of data set to guarantee high Hash achievement, is secondly that explicit expression that futuramic optimization method has obtained hash function has solved in the past manifold learning without the difficulty of explicit expression;
The interpretation of binarization.Turning low-dimensional real number value for dimension binary time, owing to adopting non-negative indication and novel scale-of-two to transform, the binary representation that makes to obtain has interpretation and similarity continues to keep.This is different from the binarizing method of existing hash method;
Low complex degree: owing to adopting efficient optimization method and the methods of sampling, the process complexity of learning hash function is had nothing to do with large data instance quantity, the complexity under worst condition is linear.
Low storage capacity: because the use binary code of innovation is replaced the storage means of formal data, make the storage of large data save the space of up to ten thousand times.
Accompanying drawing explanation
Fig. 1 is the dimensionality reduction result of a test case;
Fig. 2 is the binary code of the picture of Fig. 1.
Embodiment
From network, intercept 70,000 animal pictures at random, suppose that every pictures needs the storage space of 1M, (noting this picture has not been the picture of very fidelity), so whole data set needs the storage of 70G space.The present invention replaces every pictures to replace with 4 binary codes, and total mistake only needs the storage space of about 3.5K.Than original storage, nearly 20000 times have been economized like this.
(1) because common 4G internal memory computer can be processed algorithm 100,000 examples of the present invention.Therefore to this data set, the present invention need not sample, and directly with 70,000 data sets, trains and obtains hash function.And finally obtain each example and use 4 binary representations.
(2), to each test case, the low-dimensional real number value that first the present invention obtains it represents: be 0.4,0,0.1,0.7, and (see figure 1).
This expression: 1) be reduced to 4 dimensions from 784 dimensions of original description picture; 2) kept its partial structurtes, the neighbours of its luv space are lower dimensional space or neighbours; 3) non-negative, this makes the inventive method have clear and definite semanteme is interpretation.According to upper figure, it is considered herein that the picture of monkey can be formed by four basic weight structures, the weight of each base is exactly that its thinking coordinate represents, i.e. (0.4,0,0.1,0.7), obvious the second dimensional weight is 0, can say that this picture is not formed by second base.The binarization principle according to the present invention, the binary code of this picture is: 1,0,1,1, (see figure 2).
(3), according to this binary code, the present invention also shows that this picture is not formed by second base.Therefore the process of the present invention's coding is explainable.And be easy to prove that similarity of the present invention keeps problem.For example two four-dimensional pictures are respectively (0.51,0.51,0.51,0.51) and (0.49,0.49,0.49,0.49), and the present invention is encoded to them (1,1,1,1) and (1,1,1,1).Obviously in their Euclidean distance of real number value space, show that they are similar.The result obtaining by coding of the present invention is also similar.But adopt common Hash encoding law, these two pictures are encoded into (1,1,1,1) and (0,0,0,0).Obviously the similarity of luv space can not be held in scale-of-two (being hamming) space.This shows that it is effective that similarity of the present invention keeps.
Claims (7)
1. the large data retrieval method based on sparse Hash, comprises the steps:
(1) from large data, data from the sample survey is regarded training set X;
(2) with X, train hash function;
(3) to also not obtaining the example of binary code in large data sets, do not carry out binary coding, and by coding after large data storing calculator memory or outside disk;
(4) to new test case, first obtain its low-dimensional real number value, then obtain its low-dimensional binary code, last, follow the binary code of large data to carry out similarity searching the binary code of test case, obtain its similar example.
3. according to the method for claim 1, described step (2) comprises following process:
A). set up objective function:
wherein X is training set, and B is base space, and each vector of B is training base vector out from training set X, and S is that X is projected in the low-dimensional real number value in base space B, λ
1and λ
2the adjustable parameter of obtaining by ten folding cross validation methods, w
i,jtwo example x in X
iand x
jbetween the projection of Euclidean distance in gaussian kernel, s
iand s
jtwo vectors in matrix S, B
i,jthe capable and j column element of i in matrix B, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, and n is the number of example, and k is the number of base vector,
represent that in S, each element is non-negative;
B). S is converted to binary code;
C). set up hash function.
4. according to the method for claim 3, described process B) in, in S, non-zero element converts 0 to, otherwise is 1.
5. the process of setting up hash function according to the method for claim 3, described process C) is: finding cryptographic hash in training set X is that 1 example is class A entirely
m1, m=1 ..., d, the example that residue cryptographic hash is 0 is classified as class A
m0, m=1 ..., d, obtains 2d class, and hash function is defined as:
If the dimension of S is d, the dimension of X is D, D>>d, and in d dimension, every one dimension is a binary vector, for every one dimension in d dimension, sets up a hash function, sets up altogether d hash function;
In formula, X
ii the vector of matrix X, S
ii vector of matrix S, i=1 ..., n.
6. according to the method for claim 1, described step (3) is to each example x of large data, by s=(B'B+2I)
-1b'x obtains the low-dimensional real number value of x, then by hash function, obtains its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.
7. according to the method for claim 1, described step (4) is to each example x of test data set
t, pass through s
t=(B'B+2I)
-1b'x
tobtain x
tlow-dimensional real number value, then by hash function, obtain its low-dimensional binary code; Wherein, B is the base space of step definition above, and I is the unit matrix with dimension with B.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310457033.4A CN103605653B (en) | 2013-09-29 | 2013-09-29 | Big data retrieval method based on sparse hash |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310457033.4A CN103605653B (en) | 2013-09-29 | 2013-09-29 | Big data retrieval method based on sparse hash |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103605653A true CN103605653A (en) | 2014-02-26 |
CN103605653B CN103605653B (en) | 2017-01-04 |
Family
ID=50123878
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310457033.4A Expired - Fee Related CN103605653B (en) | 2013-09-29 | 2013-09-29 | Big data retrieval method based on sparse hash |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103605653B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462458A (en) * | 2014-12-16 | 2015-03-25 | 芜湖乐锐思信息咨询有限公司 | Data mining method of big data system |
CN104462459A (en) * | 2014-12-16 | 2015-03-25 | 芜湖乐锐思信息咨询有限公司 | Neural network based big data analysis and processing system and method |
CN104484566A (en) * | 2014-12-16 | 2015-04-01 | 芜湖乐锐思信息咨询有限公司 | Big data analysis system and big data analysis method |
CN113377294A (en) * | 2021-08-11 | 2021-09-10 | 武汉泰乐奇信息科技有限公司 | Big data storage method and device based on binary data conversion |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402617A (en) * | 2011-12-23 | 2012-04-04 | 天津神舟通用数据技术有限公司 | Easily compressed database index storage system using fragments and sparse bitmap, and corresponding construction, scheduling and query processing methods |
-
2013
- 2013-09-29 CN CN201310457033.4A patent/CN103605653B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402617A (en) * | 2011-12-23 | 2012-04-04 | 天津神舟通用数据技术有限公司 | Easily compressed database index storage system using fragments and sparse bitmap, and corresponding construction, scheduling and query processing methods |
Non-Patent Citations (3)
Title |
---|
XIAOFENG ZHU,ETC.: "Sparse Hashing for Fast Multimedia Search", 《ACM TRANSACTIONS ON INFORMATION SYSTEMS》, vol. 31, no. 2, 31 May 2013 (2013-05-31), XP058018274, DOI: http://dx.doi.org/10.1145/2457465.2457469 * |
张啸: "基于稀疏谱哈希的图像索引", 《中国优秀硕士学位论文全文数据库》, no. 7, 15 July 2011 (2011-07-15), pages 138 - 506 * |
欧阳遄飞: "基于结构化稀疏谱哈希的图像索引算法", 《中国优秀硕士学位论文全文数据库》, no. 7, 15 July 2012 (2012-07-15), pages 138 - 2166 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462458A (en) * | 2014-12-16 | 2015-03-25 | 芜湖乐锐思信息咨询有限公司 | Data mining method of big data system |
CN104462459A (en) * | 2014-12-16 | 2015-03-25 | 芜湖乐锐思信息咨询有限公司 | Neural network based big data analysis and processing system and method |
CN104484566A (en) * | 2014-12-16 | 2015-04-01 | 芜湖乐锐思信息咨询有限公司 | Big data analysis system and big data analysis method |
CN113377294A (en) * | 2021-08-11 | 2021-09-10 | 武汉泰乐奇信息科技有限公司 | Big data storage method and device based on binary data conversion |
CN113377294B (en) * | 2021-08-11 | 2021-10-22 | 武汉泰乐奇信息科技有限公司 | Big data storage method and device based on binary data conversion |
Also Published As
Publication number | Publication date |
---|---|
CN103605653B (en) | 2017-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xie et al. | Generative pointnet: Deep energy-based learning on unordered point sets for 3d generation, reconstruction and classification | |
Zafar et al. | A novel discriminating and relative global spatial image representation with applications in CBIR | |
Ding et al. | Cross-modal hashing via rank-order preserving | |
Xie et al. | Contextual query expansion for image retrieval | |
Zhang et al. | Pointwise geometric and semantic learning network on 3D point clouds | |
EP3166020A1 (en) | Method and apparatus for image classification based on dictionary learning | |
Angrish et al. | MVCNN++: computer-aided design model shape classification and retrieval using multi-view convolutional neural networks | |
Ali et al. | Modeling global geometric spatial information for rotation invariant classification of satellite images | |
Sadeghi-Tehran et al. | Scalable database indexing and fast image retrieval based on deep learning and hierarchically nested structure applied to remote sensing and plant biology | |
CN104199842A (en) | Similar image retrieval method based on local feature neighborhood information | |
Bu et al. | Local deep feature learning framework for 3D shape | |
Luo et al. | Asymmetric discrete cross-modal hashing | |
CN103473307A (en) | Cross-media sparse Hash indexing method | |
Zhang et al. | Scalable discrete matrix factorization and semantic autoencoder for cross-media retrieval | |
Hou et al. | Hitpr: Hierarchical transformer for place recognition in point cloud | |
CN103605653A (en) | Big data searching method based on sparse hash | |
Pont et al. | Principal geodesic analysis of merge trees (and persistence diagrams) | |
CN105760875A (en) | Binary image feature similarity discrimination method based on random forest algorithm | |
Wang et al. | Improving deep learning on point cloud by maximizing mutual information across layers | |
Li et al. | Sketch4Image: a novel framework for sketch-based image retrieval based on product quantization with coding residuals | |
Lv et al. | Retrieval oriented deep feature learning with complementary supervision mining | |
Zhang et al. | Modeling spatial and semantic cues for large-scale near-duplicated image retrieval | |
CN103324942A (en) | Method, device and system for image classification | |
Li et al. | ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval. | |
Zhao et al. | MapReduce-based clustering for near-duplicate image identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170104 |