CN105868355A

CN105868355A - Large-scale multimedia data spatial index method

Info

Publication number: CN105868355A
Application number: CN201610187012.9A
Authority: CN
Inventors: 李晖; 陈梅
Original assignee: Guizhou Youlian Borui Technology Co Ltd; Guizhou University
Current assignee: Guizhou Youlian Borui Technology Co Ltd; Guizhou University
Priority date: 2016-03-29
Filing date: 2016-03-29
Publication date: 2016-08-17

Abstract

The invention discloses a large-scale multimedia data spatial index method. A node is split into three node bodies through a scientific splitting algorithm, m/M is selected to be 30-40%, and data needing to be processed is placed in a processor low in occupancy rate to be processed. By means of the splitting algorithm, the node is split into the three node bodies, overlapped areas of the node bodies are allocated into one node body, multi-path search is reduced during data retrieval, and query efficiency is improved. Meanwhile, the capacity of the node is increased, m/M is selected to be 30-40%, a good node utilization rate can be kept, and the node space utilization rate is fully used. Query of large-scale multimedia data is achieved.

Description

A kind of space index method of large scale multimedia data

Technical field

The present invention relates to a kind of space index method, a kind of large scale multimedia data Space index method.

Background technology

For multimedia messages, computer be difficult to judge as the mankind between multimedia object be No have similarity.Therefore, people, by extracting the feature of multimedia messages, i.e. use feature The method of vector judges whether have similarity between them.Characteristic vector is typically one group high The set of vectors of position, in order to effectively extract large scale multimedia information, is generally directed to big The characteristic vector of scale multimedia data object sets up its spatial index, realizes multimedia messages The inquiry of greater efficiency.In multimedia information retrieval, multimedia object is often represented as one The characteristic vector of individual higher-dimension, the similarity between two multimedia objects depends on two corresponding height Distance (conventional Euclidean distance) between dimensional feature vector.Thus multimedia retrieval turns Become calculating the calculating of the spacing of set of eigenvectors and characteristic vector to be checked in data base.

Spatial index is a kind of data organized according to data space distribution characteristics and store data Structure.Spatial index is in large scale multimedia market demand, e.g., and SR-Tree, A-Tree Deng, there is substantial amounts of overlapping region between index node, and internodal overlapping region be the biggest, Multichannel during object retrieval can be caused to search increase, thus reduce large scale multimedia data Search efficiency.

In order to solve the problems referred to above, the present invention proposes the space of a kind of large scale multimedia data Indexing means (KSR-Tree).

Summary of the invention

It is an object of the invention to, it is provided that the space index method of a kind of large scale multimedia data. The method uses splitting algorithm can inquire about large scale multimedia data, keeps joint very well Point utilization rate, and it is effectively improved search efficiency, also make the capacity of node storage data Increase.

For solving above-mentioned technical problem, the technical scheme that the present invention provides is as follows: a kind of extensive The space index method of multi-medium data: by utilizing splitting algorithm that one node split is become three Individual, choose m/M is 30%-40% simultaneously, then need to data to be processed to be placed on occupancy few Node on process.

In the space index method of aforesaid large scale multimedia data, described splitting algorithm Step is,

A., when interstitial content reaches M+1 or M+2, whether first decision node is to overflow for the first time, The most then reinsert splitting algorithm, postpone the division of node, otherwise, then split vertexes；

The spilling of node occurs after B. reinserting again, then node needs division；From node In M+1 or M+2 data object in, arbitrarily select three objects as in initial clustering The heart；

C. each data object distance to three initial cluster centers is calculated, initial by three Cluster centre is respectively divided the center of each data object of its nearest neighbours；

Recalculate the average of the data object divided the most again；

E. iteration C-D two step is until new average is equal with former average or is less than appointment threshold value；With This is as three intermediate nodes after division.

In the space index method of aforesaid large scale multimedia data, described threshold value is 0.0001。

Compared with prior art, the present invention is by utilizing splitting algorithm that one node split is become three Individual node, is divided into the overlapping region of node in one node, during minimizing data retrieval Multichannel is searched, and improves search efficiency, makes the capacity of node increase simultaneously；Choose m/M For 30%-40%, fine Duty-circle can be kept, the most sufficiently make use of node space profit By rate；Achieve the inquiry to large scale multimedia data.

KSR-Tree and SR-Tree of the application, A-Tree have been carried out a series of by applicant Contrast experiment, analysis of experimental data is as follows:

Experimental situation

1. hardware environment: processor AMD 6core, internal memory 16G, hard disk 2T

2. operating system platform: Ubuntu 12.04-64 position

3. programmed environment: Eclipse for C/C++

Testing scheme

Verified the effectiveness of this spatial index algorithm by experiment, with other, there is representative simultaneously The Spatial Data Index Technology of property, such as the comparison of SR-Tree, A-Tree.Reality in performance evaluation Test middle employing real data both to be from obtaining micro-in the media computation group of Microsoft Research, Asia The soft real large scale multimedia image data set that must answer image and video search engine MSRA-MM2.0 data set.

High dimensional feature vector data used in experiment include 128 dimension Wavelet Texture and 256 dimension RGB feature two kinds, they are both from according to Bing Images photographic search engine The actual picture data included, data total amount is 1,000,000.100 pictures institutes therein are right The characteristic vector answered is selected and is used as query set.In an experiment, by comparative analysis data set, The parameters such as k value, page size index based on KSR-Tree, SR-Tree and A-Tree KNN retrieve the performance i.e. impact of average response time.

Adjust the impact after data set size

In an experiment, we are dimensioned to 200,000,400,000,800,000,100 data set Ten thousand, the dimension of data is respectively 128 dimensions and 256 dimensions.Experimental result is as shown in Fig. 2-Fig. 7.

Test result indicate that, A-Tree, when data volume is 200,000, achieves preferably inspection Suo Xiaoguo.This is because this spatial index introduces characteristic vector approximation table based on compression thought Show technology, it is hereby achieved that bigger fan-in and fan-out.But A-Tree is owing to have employed Compress technique, adds the process of a decoding, and needs to tie with reference to its father node and child The position of point, but when data volume increases, its recall precision is on the contrary not as KSR-Tree. Therefore, although always A-Tree retrieval effectiveness under low data bulk is pretty good, but in multimedia In retrieval when data are bigger, the response time to kNN inquiry of A-Tree is not special Preferable.See again its time of SR-Tree always higher than KSR-Tree, this be due to In higher dimensional space, the index divided based on space can cause a large amount of weights in search volume is expanded Folded region and the access of data object and analysis.

On the average response time of inquiry, kNN based on KSR-Tree in this paper Retrieval achieves best result on response time.Main cause is: (1) is the most On media data, data volume is big, and dimension is higher.Cause the overlapping region between node more, And the divisional mode of KSR-Tree makes to which reduce multichannel and searches.(2) cut carrying out kNN During branch, owing to have employed node split algorithm in this paper, just reduce when kNN beta pruning Within to hunting zone, respective nodes repeat inquiry.

Adjust the impact after K value size

The impact on kNN based on KSR-Tree retrieval performance of this experimental evaluation k value.Number Being dimensioned to 1,000,000 according to collection, the dimension of data is respectively 128 dimensions and 256 dimensions.Experiment knot Fruit is as shown in figures s-11.

From Fig. 8-Fig. 9 it can be seen that although as the increase of k value, three kinds of index technologies Average response time strains big mutually, but it can be seen that the increase of KSR-tree is minimum , and its response time is also best.KSR-Tree on response time still It is better than other two kinds of index technologies.No matter it is at 128 dimensions or 256 dimensions, the sound of A-Tree It is always the highest between Ying Shi.SR-Tree is the most poor, and this is that search during kNN retrieval is initial Radius is relatively big and overlap between node is higher causes.

It can be seen that along with the increase of data set, response time also exists from Figure 10-Figure 11 Ceaselessly increase.But under same data set, the amplitude of variation of KSR-Tree is less.With K value the biggest, its response time there has also been a little to be increased.This is owing to expanding in search volume Zhang Zhong, result in the regions to a large amount of overlaps and the access of data object and analysis, thus increases Time of response.

Adjust the impact after Page size size

The change of the experimental evaluation page size impact on the recall precision of KSR-Tree, k Value is set to 100, and page size is dimensioned to 32k, 64k and 128k.Experimental result As shown in Figure 12-Figure 13.

For KSR-Tree, increase page size, it is meant that have bigger Fanout, can accommodate more child's data inside the most each node, thus effectively will weight Folded region comprises into a node.

Going out from the experimental results, tie up 128, page size is set to retrieval during 64k Efficiency this be best.Tieing up 256, page size is set to retrieval effect during 128k Rate this be the most stable, always be two page size less than other.Based on KSR-Tree KNN retrieval be minimum on response time.

Split vertexes number is selected different numerical value to carry out the impact of response time by applicant Test, the nodes after division selects the experimental result of (3,4,5,6) as shown in figure 14. In the case of 128 dimension data, the page size of index node is 64KB, and data set is from 200,000 Bar to 1,000,000.When vertical coordinate position is by inquiry, the response time of inquiry, unit is the second. Response time curve is the response time that nodes selects (3,4,5,6) the most successively, As seen from the figure, the when that the nodes after division being chosen as 3, response time is the shortest, and performance is It is better than 4,5,6.

Split vertexes number is selected different numerical value to carry out the impact of response time by applicant Test, the nodes after division selects the experimental result of (3,4,5,6) as shown in figure 15, In the case of 256 dimension data, the page size of index node is 64KB, and data set is from 200,000 Bar to 1,000,000.When vertical coordinate position is by inquiry, the response time of inquiry, unit is the second. Response time curve is the response time that nodes selects (3,4,5,6) the most successively, As seen from the figure, the when that the nodes after division being chosen as 3, response time is the shortest, and performance is It is better than 4,5,6.

Being found out by Figure 14-15, the when that the nodes after splitting being chosen as 3, response time is the shortest.

Accompanying drawing explanation

Fig. 1 is the splitted construction figure of the present invention；

Fig. 2 is 128 dimensions, and when k is 20, data set size is respectively to KSR-Tree, SR-Tree Impact with A-Tree；Wherein, abscissa is data set size, and vertical coordinate is response time；

Fig. 3 is 256 dimensions, and when k is 20, data set size is respectively to KSR-Tree, SR-Tree Impact with A-Tree；Wherein, abscissa is data set size, and vertical coordinate is response time；

Fig. 4 is 128 dimensions, and when k is 50, data set size is respectively to KSR-Tree, SR-Tree Impact with A-Tree；Wherein, abscissa is data set size, and vertical coordinate is response time；

Fig. 5 is 256 dimensions, and when k is 50, data set size is respectively to KSR-Tree, SR-Tree Impact with A-Tree；Wherein, abscissa is data set size, and vertical coordinate is response time；

Fig. 6 is 128 dimensions, and when k is 100, data set size is respectively to KSR-Tree, SR-Tree Impact with A-Tree；Wherein, abscissa is data set size, and vertical coordinate is response time；

Fig. 7 is 256 dimensions, and when k is 100, data set size is respectively to KSR-Tree, SR-Tree Impact with A-Tree；Wherein, abscissa is data set size, and vertical coordinate is response time；

Fig. 8 is 128 dimensions, and data set size is 1,000,000, k value respectively to KSR-Tree, The impact of SR-Tree and A-Tree；Wherein, abscissa is k value, when vertical coordinate is for response Between；

Fig. 9 is 256 dimensions, and data set size is 1,000,000, k value respectively to KSR-Tree, The impact of SR-Tree and A-Tree；Wherein, abscissa is k value, when vertical coordinate is for response Between；

Figure 10 is k value variation diagram under 128 dimensions；Wherein, abscissa is data set size, vertical Coordinate is response time；

Figure 11 is k value variation diagram under 256 dimensions；Wherein, abscissa is data set size, vertical Coordinate is response time；

Figure 12 is 128 dimensions, the different page size impacts on KSR-Tree；Wherein, horizontal Coordinate is data set size, and vertical coordinate is response time；

Figure 13 is 256 dimensions, the different page size impacts on KSR-Tree；Wherein, horizontal Coordinate is data set size, and vertical coordinate is response time；

Figure 14 is 128 dimensions, and page size is 64KB, data set from 200,000 to 1,000,000, Split vertexes number selects the impact on response time of the different numerical value；Wherein, abscissa is data Collection size, vertical coordinate is response time；

Figure 15 is 256 dimensions, and page size is 64KB, data set from 200,000 to 1,000,000, Split vertexes number selects the impact on response time of the different numerical value；Wherein, abscissa is data Collection size, vertical coordinate is response time.

Detailed description of the invention

Embodiment.The space index method of a kind of large scale multimedia data, is divided by utilization One node split is become three by algorithm, and choose m/M is 30%-40% simultaneously, then by needs The data of reason are placed on the node that occupancy is few and process.

Described m is the lower limit of object in node, and M is the upper limit of object in node.The value of m The 30-40% taking M is optimum, the preferable search efficiency of guarantee and insertion efficiency, the most both Fine Duty-circle can be kept, it is also possible to make full use of the space availability ratio of node.Judgement accounts for By the method for the few node of rate it is: two index nodes, each node can at most deposit 10 Record, if one of them node A has housed 6 records, another node B, has deposited Put 8 records, then the occupation rate of node A is less.

The step of described splitting algorithm is,

Recalculate the average of the data object divided the most again；

E. iteration C-D two step is until new average is equal with former average or is less than appointment threshold value；With This is as three intermediate nodes after division.Described threshold value is smaller floating more than 0 Count, be set as 0.0001 here.

As it is shown in figure 1, the node structure of KSR-Tree is similar with SR-Tree, simply at joint " one dividing into three " is used in the division of point, if the data object in leaf node C When reaching M+1, first judge whether that needs reinsert.As need not, this node is carried out point Split, after division, three new leaf nodes will be produced in intermediate node 1.In judging again Intermediate node 1, the need of dividing, if do not divided, divides stopping；Then will if desired for division Node 1 splits into three new intermediate nodes, finally repeats this process to root node root.

In the structure chart of KSR-Tree, if node C overflows, the number in node C According to object, node C, I and J after carrying out the division of " one dividing into three ", are obtained.Now Node 1 does not overflows, and division stops, and obtains the structure chart shown in Fig. 1.

Claims

1. the space index method of large scale multimedia data, it is characterised in that: pass through Utilizing splitting algorithm that one node split is become three, choose m/M is 30%-40% simultaneously, then Process needing data to be processed to be placed on the node that occupancy is few.

The space index method of large scale multimedia data the most according to claim 1, It is characterized in that: the step of described splitting algorithm is,

Recalculate the average of the data object divided the most again；

The space index method of large scale multimedia data the most according to claim 2, It is characterized in that: described threshold value is 0.0001.