CN103605653B - Big data retrieval method based on sparse hash - Google Patents
Big data retrieval method based on sparse hash Download PDFInfo
- Publication number
- CN103605653B CN103605653B CN201310457033.4A CN201310457033A CN103605653B CN 103605653 B CN103605653 B CN 103605653B CN 201310457033 A CN201310457033 A CN 201310457033A CN 103605653 B CN103605653 B CN 103605653B
- Authority
- CN
- China
- Prior art keywords
- big data
- hash function
- dimensional
- hash
- training set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000006870 function Effects 0.000 claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 24
- 238000012360 testing method Methods 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 18
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000002790 cross-validation Methods 0.000 claims description 2
- 229910002056 binary alloy Inorganic materials 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000282693 Cercopithecidae Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 238000005284 basis set Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2255—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention is big data Approximate Retrieval method, specifically based on sparse hash big data retrieval method.Application and development is carried out mainly for the storage of big data and the retrieval of big data.Determine the size of training set according to theoretical and calculator memory first by the method sampled.Then training set is learnt, learn the binary coding of hash function and the training set big data encoding.Then according to the hash function acquired, big data are carried out binary coding.At this point it is possible to carry out online retrieval application, i.e. to a test case, first obtain its binary code according to the hash function obtained, then the binary code of big data is carried out real-time retrieval.This method is linear to the time complexity of big data retrieval, can solve the manifold learning problem without explicit function, and reduce the amount of storage to up to ten thousand times of big data, it is easy to implements, relates only to some simple mathematical modeies when writing code.
Description
Technical field
The present invention relates to Computer Science and Technology field and areas of information technology, be specifically related to big data, particularly one
Plant and use sparse hash to carry out the big data retrieval method such as picture, text, music.
Background technology
Big data refer to the number that the instrument of routine cannot be used under the present conditions to retrieve data content and manage
According to collection.Data volume is big, data type is various, value density is low and processing speed is the feature of big four highly significants of data soon.
The research that at present big data knowledge finds be concentrated mainly on division, cluster, retrieve, increment (in batches, online or parallel) learn this 4
Individual aspect.
The most fewer to the research of big data retrieval issue handling.The when of retrieval user it is generally desirable to can quickly from
All data obtain oneself required thing.The problem that this relates to a speed and how accuracy rate is chosen.Two
Before 10 years even 10 years, what researcher was pursued is accuracy rate.Therefore, devise various tree-like result KD-tree, the standard such as M-tree
True carries out database retrieval, and achieves the biggest application.Last decade, along with becoming increasingly popular of network, the product of big data
Raw, accurate retrieval can not meet the needs of user.Lot of documents shows, if the dimension of data is less than 10 dimensions, accurately examines
Rope is well positioned to meet user and is actually needed.But dimension once exceedes this threshold value or higher, the accurately complexity of retrieval
The highest, worst case reaches to travel through the complexity of whole data base, and this is the most infeasible.
In recent years, Approximate Retrieval has been achieved for significantly developing, particularly network retrieval, user pursue be quick and
The multimedia retrieval of approximation.In numerous Approximate Retrieval methods, hash method is the most prominent.The principle of hash method is high
The real-value data of dimension is reduced to low-dimensional binary data and the similarity that preserves between data, then large data sets is protected as far as possible
There is calculator memory or outer disk, reach the purpose of quick-searching with this.
Summary of the invention
The present invention studies big data Approximate Retrieval problem.
It is an object of the invention to provide simple and effective big data Approximate Retrieval algorithm.The method can solve big data
Retrieve high complexity and low accuracy rate etc..By the popular structure keeping data, i.e. this method ensures that binary system is as much as possible
Keep the partial structurtes of original high dimensional data to improve Hash achievement, reduce algorithm complex to line by effective optimization method
Property.The present invention comprises two critical process, i.e. hash function study and big data real-time retrieval.Wherein hash function study includes
Higher-dimension real-value data changes into low-dimensional real number value and low-dimensional real number value such as changes at dimension binary system two process.Big data real-time retrieval is i.e.
First turning example according to the hash function obtained is binary system, then retrieves at calculator memory.
Specifically comprising the following steps that of this method
(1) from big data, sampled data is used for training hash function as training set.Big data bulk is the hugest, root
Theoretical according to statistics, need not be by all data as training set.The present invention first Sampled portions data are as training set.And extraction
Training set size n byDetermine, wherein tα/2Represent the value of confidence level, can be obtained by t-distribution marginal value, ε table
Show the allowable error of maximum.Various parameters arrange and please see table.
So far, training set X is obtained.
(2) hash function is trained with X.First design object function turns higher-dimension real data to low-dimensional data.Object function
It is defined as:
Wherein
X is training set, and B is base space, and each vector of B is the base vector training out from training set X, and S is that X is projected in base
Low-dimensional real number value in space B, λ1And λ2It is the adjustable parameter obtained by ten folding cross validation methods, wi,jIt is two realities in X
Example xiAnd xjBetween Euclidean distance projection in gaussian kernel, siAnd sjIt is two vectors in matrix S, Bi,jIt is in matrix B
Ith row and jth column element, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, and n is example
Number, k is the number of base vector,Represent that in S, each element is non-negative.
Section 1 | | X-BS | |2Target be in base space B, to reconstruct training set X obtain S and reconfiguring false is wished
Little;Section 2 Σi,jwi,j||si-sj||2Being to maintain localized epidemics's result of original training set X, this ensures binary data
Keep the similarity of original high dimensional data, thus ensure the achievement of Hash;Section 3 ensures that the S obtained is sparse;Section 4Guarantee that obtaining S is non-negative.According to this object function, the S obtained is that the low-dimensional of X represents.The second of training hash function
Step is i.e. converted into binary code S: in S, non-zero element is converted into 0, is otherwise 1.3rd step of training hash function i.e. obtains
Hash function.The dimension assuming S is d, and the dimension of X is D, (D > > d), binary-coded a length of d.D tie up in the most one-dimensional
As a vector, this vector is binary system (i.e. two class problems in classification), and the present invention sets up a Hash letter for the most one-dimensional
Number, sets up altogether d hash function.The process setting up hash function is very simple, i.e. finds cryptographic Hash in training set X to be all 1
Example is class Am1, m=1 ..., d, remain the example that cryptographic Hash is 0 and be classified as class Am0, m=1 ..., d, obtain 2d class, hash function
It is defined as:
If the dimension of S is d, the dimension of X is D, D > > d, d dimension in the most one-dimensional be a binary vector, for d tie up in each
Dimension sets up a hash function, sets up altogether d hash function;
In formula, XiIt is the i-th vector of matrix X, SiIt is the i-th vector of matrix S, i=1 ..., n.
(3) example not obtaining binary code in large data sets being carried out binary coding process is: to each
Example x, by s=(B'B+2I)-1B'x obtains the low-dimensional real number value of x, is then obtained its low-dimensional binary system by hash function
Code;Wherein, B is the base space of previous step definition, and I is with B is with the unit matrix of dimension.So, whole big data are entered
Row coding so that big data can be stored in calculator memory or outer disk.
(4) to new test case xt, s is passed throught=(B'B+2I)-1B'xtObtain xtLow-dimensional real number value, then pass through
Hash function obtains its low-dimensional binary code;Wherein, B is the base space of previous step definition, and I is with B is with the list of dimension
Bit matrix.Finally, the binary code of test case is carried out similarity searching with the binary code of big data, to obtain it
Similar case.
Wherein the step (2) of the present invention is crucial, it is ensured that the efficiency of algorithm and effect.Its algorithm complex is closely dimension D
Cube.In big market demand, the size of dimension D is much smaller more than example quantity, and therefore inventive algorithm complexity is
The linear relationship of example size.And owing to step of the present invention (2) considers the flow structure of holding data, the effect of algorithm can
To be protected.It is the low-dimensional real number value of non-negative additionally, due to generate so that the result obtained is easy to explain.
The sparse hash big data retrieval model of the present invention is characterised by: use the method for Corresponding Sparse Algorithm and sampling to reduce
Algorithm complex;Use manifold learning theory to generate hash function and improve Hash achievement;Generate explicit hash function and avoid stream
The implicit expression hash function of shape study;Binarization principle refers to that Hash result is soluble;The storage problem of big data obtains significantly
Degree reduces.
Sample big data: it is extremely difficult for generally carrying out various data mining study in whole big data.Even if can
OK, complexity is the highest, and sampling approach makes the operation classifying big data become feasible, and makes the complexity fall of classification
Low to linear.The result that this biggest data mining is expected.
Theory of manifolds embeds Hash learning model: theory of manifolds has proven to a kind of very effective partial structurtes and protects
Holding method, the method is particularly important to setting up Hash model.The present invention study Hash during add manifold regularization because of
Son.Primary and foremost purpose is to maintain the manifold result of data set and guarantees high Hash achievement, next to that futuramic optimization method obtains
The explicit expression having arrived hash function solves the conventional manifold learning difficulty without explicit expression;
The interpretability of binarization.It is with when tieing up binary turning low-dimensional real number value, owing to using non-negative indication
Convert with novel binary system so that the binary representation obtained has interpretability and similarity continues to keep.This difference
Binarizing method in existing hash method;
Low complex degree: owing to using efficient optimization method and sampling approach so that the process of study hash function is complicated
Spending unrelated with big data instance quantity, the complexity under worst condition is linear.
Low storage capacity: owing to the binary code that uses of innovation replaces the storage method of actual data so that big data
Storage saves the space of up to ten thousand times.
Accompanying drawing explanation
Fig. 1 is the dimensionality reduction result of a test case;
Fig. 2 is the binary code of the picture of Fig. 1.
Detailed description of the invention
Random 70,000 animal pictures of intercepting from network, it is assumed that every pictures needs the memory space of 1M, (notes this picture
It is not the most the picture of very fidelity), then whole data set needs 70G space to store.The present invention replaces every pictures with 4 two
Carry system code replaces, and total mistake only needs the memory space of about 3.5K.So nearly 20000 times are saved than original storage.
(1) inventive algorithm 100,000 example can be processed due to common 4G internal memory computer.Therefore to this data set, this
Bright need not sample, directly be trained obtaining hash function with 70,000 data sets.And finally give each example and use 4 two
System represents.
(2) to each test case, first the present invention obtains its low-dimensional real number value and represents: be 0.4, and 0,0.1,0.7,
(see figure 1).
This represents: 1) be reduced to 4 dimensions from 784 dimensions of original description picture;2) maintain its partial structurtes, i.e. it
The neighbours of luv space are lower dimensional space or neighbours;3) non-negative, this makes the inventive method have clear and definite semanteme i.e.
Interpretability.According to upper figure, it is considered herein that the picture of monkey can be formed by four basic weight structures, the weight of each base is exactly it
Four-dimensional coordinate represent, i.e. (0.4,0,0.1,0.7), it is clear that the second dimensional weight is 0, it may be said that this picture is not basis set by second
Become.According to binarization principle of the present invention, the binary code of this picture is: 1,0,1,1, and (see figure 2).
(3) according to this binary code, the present invention displays that this picture is not formed by second base.Therefore the present invention compiles
The process of code is explainable.And it can easily be proven that the similarity Preserving problems of the present invention.Such as two four-dimensional pictures divide
Not Wei (0.51,0.51,0.51,0.51) and (0.49,0.49,0.49,0.49), the present invention is encoded to (1,1,1,1) them
(1,1,1,1).Obviously show that they are similar in their Euclidean distance of real number value space.Encoding by the present invention
To result be also similar.But use common Hash encoding law, the two picture be encoded into (1,1,1,1) and
(0,0,0,0).Obviously the similarity of luv space can not be kept in binary system (i.e. hamming) space.The phase of this display present invention
Keep being effective like property.
Claims (3)
1. big data retrieval method based on sparse hash, comprises the steps:
(1) from big data, sampled data regards training set X;
(2) hash function is trained with X;
(3) example not obtaining binary code in large data sets is carried out binary coding, and by the big data after coding
It is stored in calculator memory or outer disk;
(4) to new test case, first obtain its low-dimensional real number value, then obtain its low-dimensional binary code,
After, the binary code of test case is carried out similarity searching with the binary code of big data, obtains its similar case;
Training set size n of training set X of described step (1) byDetermine, wherein tα/2Represent the value of confidence level, logical
Crossing t-distribution marginal value to obtain, ε represents the maximum allowable error of setting;
Described step (2) includes following process:
A). set up object function:
Wherein X is training set, and B is base space, and each vector of B is the base vector training out from training set X, and S is X quilt
The low-dimensional real number value being projected in base space B, λ1And λ2It is the adjustable parameter obtained by ten folding cross validation methods, wi,jIt is X
In two example xiAnd xjBetween Euclidean distance projection in gaussian kernel, siAnd sjIt is two vectors in matrix S, Bi,jIt is
Ith row and jth column element in matrix B, i=1 ..., n represents the mark of example, j=1 ..., k represents the label of base vector, n
Being the number of example, k is the number of base vector,Represent that in S, each element is non-negative;
B). S is converted into binary code;In S, non-zero element is converted into 1, is otherwise 0;
C). set up hash function: the example finding cryptographic Hash in training set X to be all 1 is class Am1, m=1 ..., d, remain cryptographic Hash
Be 0 example be classified as class Am0, m=1 ..., d, obtain 2d class, hash function is defined as:
If the dimension of S is d, the dimension of X is D, and the most one-dimensional in D > > d, d dimension is a binary vector, the most one-dimensional in tieing up for d
Set up a hash function, set up altogether d hash function;In formula, XiIt is the i-th vector of matrix X, SiIt is the i-th of matrix S
Individual vector, i=1 ..., n.
Method the most according to claim 1, described step (3) is to big data each example x, by s=(B'B+2I)-1B'x
Obtain the low-dimensional real number value of x, then obtained its low-dimensional binary code by hash function;Wherein, B is previous step definition
Base space, I is with B is with the unit matrix of dimension.
Method the most according to claim 1, described step (4) is to test data set each example xt, pass through st=(B'B+
2I)-1B'xtObtain xtLow-dimensional real number value, then obtained its low-dimensional binary code by hash function;Wherein, on B is
The base space of face step definition, I is with B is with the unit matrix of dimension.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310457033.4A CN103605653B (en) | 2013-09-29 | 2013-09-29 | Big data retrieval method based on sparse hash |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310457033.4A CN103605653B (en) | 2013-09-29 | 2013-09-29 | Big data retrieval method based on sparse hash |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103605653A CN103605653A (en) | 2014-02-26 |
CN103605653B true CN103605653B (en) | 2017-01-04 |
Family
ID=50123878
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310457033.4A Expired - Fee Related CN103605653B (en) | 2013-09-29 | 2013-09-29 | Big data retrieval method based on sparse hash |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103605653B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104484566A (en) * | 2014-12-16 | 2015-04-01 | 芜湖乐锐思信息咨询有限公司 | Big data analysis system and big data analysis method |
CN104462458A (en) * | 2014-12-16 | 2015-03-25 | 芜湖乐锐思信息咨询有限公司 | Data mining method of big data system |
CN104462459A (en) * | 2014-12-16 | 2015-03-25 | 芜湖乐锐思信息咨询有限公司 | Neural network based big data analysis and processing system and method |
CN113377294B (en) * | 2021-08-11 | 2021-10-22 | 武汉泰乐奇信息科技有限公司 | Big data storage method and device based on binary data conversion |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402617A (en) * | 2011-12-23 | 2012-04-04 | 天津神舟通用数据技术有限公司 | Easily compressed database index storage system using fragments and sparse bitmap, and corresponding construction, scheduling and query processing methods |
-
2013
- 2013-09-29 CN CN201310457033.4A patent/CN103605653B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402617A (en) * | 2011-12-23 | 2012-04-04 | 天津神舟通用数据技术有限公司 | Easily compressed database index storage system using fragments and sparse bitmap, and corresponding construction, scheduling and query processing methods |
Non-Patent Citations (3)
Title |
---|
Sparse Hashing for Fast Multimedia Search;XIAOFENG ZHU,etc.;《ACM Transactions on Information Systems》;20130531;第31卷(第2期);9:7-9:13 * |
基于稀疏谱哈希的图像索引;张啸;《中国优秀硕士学位论文全文数据库》;20110715(第7期);I138-506 * |
基于结构化稀疏谱哈希的图像索引算法;欧阳遄飞;《中国优秀硕士学位论文全文数据库》;20120715(第7期);I138-2166 * |
Also Published As
Publication number | Publication date |
---|---|
CN103605653A (en) | 2014-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Deep sketch hashing: Fast free-hand sketch-based image retrieval | |
Zafar et al. | A novel discriminating and relative global spatial image representation with applications in CBIR | |
Ali et al. | A hybrid geometric spatial image representation for scene classification | |
US8428397B1 (en) | Systems and methods for large scale, high-dimensional searches | |
US8849030B2 (en) | Image retrieval using spatial bag-of-features | |
Zafar et al. | Image classification by addition of spatial information based on histograms of orthogonal vectors | |
EP3166020A1 (en) | Method and apparatus for image classification based on dictionary learning | |
Yang et al. | An improved Bag-of-Words framework for remote sensing image retrieval in large-scale image databases | |
Serra et al. | Gold: Gaussians of local descriptors for image representation | |
Zhang et al. | Fast orthogonal projection based on kronecker product | |
Ali et al. | Modeling global geometric spatial information for rotation invariant classification of satellite images | |
Picard et al. | Efficient image signatures and similarities using tensor products of local descriptors | |
López-Sastre et al. | Evaluating 3d spatial pyramids for classifying 3d shapes | |
CN103605653B (en) | Big data retrieval method based on sparse hash | |
Bu et al. | Local deep feature learning framework for 3D shape | |
Hu et al. | Fast binary coding for the scene classification of high-resolution remote sensing imagery | |
Wu et al. | A multi-sample, multi-tree approach to bag-of-words image representation for image retrieval | |
Li et al. | Hashing with dual complementary projection learning for fast image retrieval | |
CN106250918A (en) | A kind of mixed Gauss model matching process based on the soil-shifting distance improved | |
CN105760875A (en) | Binary image feature similarity discrimination method based on random forest algorithm | |
Czech | Invariants of distance k-graphs for graph embedding | |
CN103324942A (en) | Method, device and system for image classification | |
Wang et al. | Random angular projection for fast nearest subspace search | |
CN105718950B (en) | A kind of semi-supervised multi-angle of view clustering method based on structural constraint | |
Zhao et al. | MapReduce-based clustering for near-duplicate image identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170104 |