CN108108770A

CN108108770A - Moving-vision search framework based on CRBM and Fisher networks

Info

Publication number: CN108108770A
Application number: CN201711493995.XA
Authority: CN
Inventors: 纪荣嵘; 林贤明; 黄晨
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2017-12-31
Filing date: 2017-12-31
Publication date: 2018-06-01

Abstract

Moving-vision search framework based on CRBM and Fisher networks is related to the image retrieval of mobile terminal.Including：1) continuously limited Boltzmann machine network struction and training；2) Fisher layer network structions and training.Using the non-linear sub-space feature information for being reduced to algorithm CRBM and finding the local feature essence of non-gaussian distribution in the global compact binary feature algorithm of polymerization, simultaneously using the network structure polymerization Fisher Vector based on Fisher, more efficient global characteristics are obtained；Compact adaptive feature is obtained using scalar quantization algorithm and bit adaptive algorithm, the image feature information length that can be transmitted according to the adaptive selection of the difference of mobile terminal network bandwidth；Retrieval phase is slightly matched Candidate Set using global characteristics and carries out Geometrical consistency inspection using local feature and accurately matched, so as to adapt to large-scale image retrieval tasks.

Description

Mobile visual search framework based on CRBM and Fisher network

Technical Field

The invention relates to image retrieval of a mobile terminal, in particular to a mobile visual search framework based on a CRBM and a Fisher network.

Background

The International Telecommunication Union (International telecommunications Union) of the united nations has counted the number of users accessing the mobile broadband on a global scale every year. According to the statistics of 2015 and 2016, the amount of users accessing the network through the mobile device globally increases from 32 billion in 2015 to 36 billion in 2016, with a net increase of 4 billion. By 2016, the number of people using mobile devices to surf the internet accounts for approximately 47% of the total population in the world. The ITU has also pointed out that the reason for the rapid growth of subscribers to networks via mobile devices is that mobile broadband services have become popular in 84% of the world, as a result of significant global efforts to build broadband infrastructure for networks. Meanwhile, with the development of mobile network broadband, the 3G network and the 4G network are spread worldwide, and the developing 4.5G and 5G networks bring more residents around the world into the wave of the mobile internet. From industry 3.0 to the present, and future industry 4.0. The internet has drastically changed human lives, and people have taken the internet as a basic tool of life. The support of each country to the internet construction enables the network speed to be continuously improved and the network cost to be continuously reduced. Meanwhile, internet education is deeply inserted into middle and small education, and each student can receive information course education, so that the young generation has the internet genes. By 2016, 12 months, the chinese internet Information Center (CNNIC) issued the thirty-nine statistical reports of the development conditions of the chinese internet. Reports show that the number of netizens in China has increased to 7.31 hundred million people, which is equivalent to the total number of European population. The popularity of the Internet in China has reached 53.2%. The number of mobile broadband net citizens reaches 6.95 hundred million, the growth rate exceeds 10 percent for three years continuously, and mobile equipment continuously occupies the use of desktop computers and notebook computers. Although there are phenomena showing diminishing dividends of the population, the population base using the mobile internet is very large. With the enormous base of mobile broadband users, the market is short of the demand for mobile services and the demand for multimedia information. More and more users are eager to experience future technologies, especially the user experience brought by mobile internet technologies.

In the field of mobile equipment production, more and more excellent IT manufacturers are leaping to produce smart phones and tablet computers and frequently push out new models. Such as Huashi, OPPO, vivo, millet, zhongxing, etc. in China. According to the statistics of HIS Technology of the domestic well-known market research organization, the global mobile phone market share and the total sales in 2016, the domestic mobile phone is reputed to achieve the third and fourth global ranking with OPPO. These manufacturers have released new types of mobile devices that are equipped with multiple sensing devices. Nowadays, cameras, GPS, gravity sensors, electronic compasses, and other devices have become standard equipment for mobile devices, and these equipment are continuously updated. By means of hardware development of mobile equipment and various sensor equipment carried on the mobile equipment, powerful applications running on a mobile terminal can establish quick connection between the real world and the information world. The user can conveniently acquire the multimedia information and the network service required by the user through the mobile network in real time. Needless to say, the picture-based search technology will certainly become one of the core technologies for future mobile internet applications. Usually, a picture contains a large amount of information, and a mode that a user takes an image of an object of interest through a mobile device to acquire related information is more convenient than a text mode, and can acquire more information than a text search. The method has the advantages that simple imagination is realized, if a tourist is interested in a certain building, the tourist can take a scene photo through an application program on a mobile phone to perform real-time search, and search which cannot be achieved by text search can be realized without inquiring the building name; a user acquires visual images, information such as a GPS (global positioning system), an electronic compass and the like from objects in the real world through the mobile intelligent terminal, transmits the related information to a large-scale visual database through the mobile internet to retrieve and acquire the related information, and finally transmits the related information to a user side through a mobile mutual-forgiveness network.

Mobile visual search applications have resulted in an emerging service model in combination with other applications, such as Augmented Reality (AR). For example, a real-time picture is taken of an object or a tool, basic information of the object and the tool is identified through a mobile visual search application, and an augmented reality program can reproduce a three-dimensional geometric model of the object at a mobile terminal and is matched with a dynamic display use method of the geometric model; if the mobile visual search application is combined with the mobile location service, a user can obtain market information, brand information, price information and location information of a certain brand in a market by opening a camera of the mobile terminal. Or, in the emergency accident site, the case handling personnel takes a picture of the accident site through the mobile visual search application, and then obtains the enhanced presentation of the three-dimensional geometric framework of the accident site environment through the mobile augmented reality application to research the solution of the emergency situation.

Mobile visual search faces a number of technical challenges:

firstly, MVS (1, prosperous, huang Xiaobin, foreign mobile visual search statement evaluation J, chinese library report 2014,40 (3): 114-128) searches in a large-scale image database, and can face the problems of search accuracy, long search time and the like for certain service, usually, the database stores massive image data, the relevant information in the massive database is only a small part, and the rest data are interference information. This presents a significant challenge to MVS search accuracy. The user has high requirement on the real-time performance of the mobile terminal search application, and the problem of long search time in a large-scale image database has great influence on the user experience. Aiming at the problem, the core idea of the mobile visual search algorithm is to quickly establish the relation between a query image and a first-relevant image in a database (2 Gu Jia, tang Sheng, xie Hongtao and the like. The mobile visual search reviews [ J ]. The computer aided design and graphics newspaper, 2017,29 (6): 1007-1021). Therefore, the visual features of the image need to have strong distinguishing capability and are compact, so that the matching technology based on the compact features can quickly acquire relevant information in a massive database.

Second, the current mobile devices are not stable in bandwidth and slow in speed. There is a significant delay problem in transmitting the image. Due to the limitation of the wireless network bandwidth, the transmission of a whole query picture can cause a great delay problem, and the user experience is greatly influenced. Therefore, it is natural to transmit the feature information of one picture better than transmitting a whole picture. Different regional network bandwidths are different, and the signal strength is different, so that the scalability of the picture feature information with different network bandwidths is also required, and the method is also very important for how to perform fast and accurate matching on the image information with different lengths.

Third, the hardware computing power and memory power of mobile devices are limited, and the real-time performance of visual feature extraction is a challenge for more complicated visual feature extraction and multitasking, etc., despite the increasing hardware power and memory power of mobile devices. The computing and storage capabilities of the hardware are still limited. Therefore, the visual feature extraction algorithm at the mobile terminal pursues low complexity of time and space, extracts compact visual features, and simultaneously, the differentiation power of the visual features must be ensured, the retrieval accuracy is ensured, and simultaneously, the requirement of real-time property is also considered.

Combining the above challenges, visual features for visual mobile devices need to satisfy the following characteristics: differentiability, compactness and scalability, and simultaneously, the visual algorithm meets the characteristic of low complexity.

In the face of these technical challenges, the most effective mobile visual search framework at present is the Compact Descriptor (CDVS) mobile visual standard (3) formulated by the Moving Picture Experts Group (MPEG) of the IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society,2016,25 (1): 179, which aims at formulating an interactive Image retrieval application bitstream syntax standard. However, in the framework, a linear degradation algorithm PCA (principal component analysis) is used for reducing the dimension of the local feature SIFT of the non-Gaussian statistics, so that the local feature information is greatly damaged; a traditional EM algorithm is used for estimating parameters of a Gaussian mixture model containing K Gaussian functions in a global feature Fisher Vector, but the setting of initial parameters has great influence on the convergence degree of the EM algorithm and is easy to converge to local optimum.

Disclosure of Invention

The invention aims to provide a mobile visual search framework based on a CRBM and a Fisher network, aiming at the problems of the conventional mobile visual search framework CDVS.

The invention comprises the following steps:

1) Constructing and training a continuous limited Boltzmann machine network;

2) And constructing and training a Fisher layer network.

In step 1), the specific method for constructing and training the continuous limited boltzmann machine network is as follows:

(1) The construction method comprises the following steps: constructing a 3-layer continuous limited Boltzmann machine network, wherein the first layer comprises 128 units, the second layer comprises 64 units, and the third layer comprises 32 units; the former layer unit is a visual unit, and the latter layer unit is a hidden unit; the visible unit and the hidden unit are connected through full connection, and the connection weight is { w }; a continuous limited Boltzmann machine (CRBM) adds a Gaussian noise continuous random unit with the average value of 0 in a sigmoid function of a visual layer in an RBM network, the structure of the continuous limited Boltzmann machine is the same as that of the RBM, the continuous limited Boltzmann machine comprises a visual layer and a hidden layer, interlayer units are connected with each other, information flows in two directions when the network is trained and used, and the weights of the two directions are the same, namely w _ij ＝w _ji (ii) a Let s _j For the output of neuron j, the input neuron state is { s } _i H hidden layer state _j By s _j Indicating, visual layer state v _i By s _i Represents:

wherein

N _j (0,1) represents a gaussian random variable with a mean of 0 and a variance of 1; constants σ and N _j The product of (0,1) produces a gaussian noise input component n _j ＝σ·N _j (0,1) with a probability distribution of:

is a sigmoid function, θ _L And theta _H Respectively a lower asymptote and an upper asymptote of the sigmoid function; parameter a _j Controlling the slope of the sigmoid function; a is a _j From small to large, the cell can transition smoothly from a deterministic state without noise to a binary random state; if a is _j Let the sigmoid function be linear over the noise range, then s _j Will obey an average ofAnd the variance is sigma ² (ii) a gaussian distribution of;

(2) The training method comprises the following steps:

the CRBM network parameters are trained by adopting a Minimum Contrast Divergence (MCD) weight updating algorithm, and only simple addition and multiplication operations are needed, so that the calculated amount is reduced; updating weight value omega of MCD training criterion _ij }, and the slope control parameter of sigmoid function { a } _j }：

Wherein the content of the first and second substances,represents the one-time sampled state of neuron j,<·&the operation represents the mean value on the training set; the formula is simplified as:

simplified a _j And (3) updating an algorithm:

(3) And (3) supervision and fine adjustment:

the weight obtained after the CRBM uses the contrast divergence algorithm (MCD) has already approached the global optimal solution; but in order to make the network robustness stronger, a back propagation algorithm is adopted for fine tuning (BP algorithm) for fine tuning; desired output target { V' _i With input data V _i Equal; adjusting each weight gradient by utilizing the error between the output and the input of the calculation model until the error is converged; the parameters to be adjusted are: connection weight between layers, bias weight of each layer, sigmoid slope parameter a _j (ii) a The objective function is:

where X is the network input data value, F _W，b，a (x) Outputting a value for the network; for each output neuron i of layer L (output layer), the residual is as follows:

l = L-1,.. 2, the residual of the ith neuron node in each layer is:

f(z _i ) For the neuron activation function:

for each layer of L = L-1,.. 2, the partial derivatives connecting the weight parameter, the bias parameter, and the slope control parameter are:

▽ _Wl J(W,b,a)＝δ ^(l+1) (s ^l ) ^T a ^l

▽ _bl J(W,b,a)＝δ ^(l+1) a ^l

▽ _al J(W,b,a)＝δ ^(l+1) (h ^l+1 ) ^T

wherein:

h ^l ＝W ^l-1 ×s ^l-1 +b ^l-1 +σN(0,1)

z ^l ＝a ^l-1 ×h ^l

s ^l ＝f(z ^l )

the gradient obtained is the gradient update of a single sample in the data set, and the training of the whole data set is only required to add each gradient to obtain an average gradient; after the gradient values of all parameters are found, each parameter is optimized using a quasi-newton optimization algorithm.

In the step 2), the specific method for constructing and training the Fisher layer network is as follows:

the gaussian mixture model is simplified for two points, assuming:

(1) Assuming equal weight per Gaussian function in GMM, i.e. ω _k ＝1；

(2) Simplification u _k (x) Is of the form:

equivalently assuming that the covariance matrices have the same determinant values; simplified gamma _j (k)：

Suppose w _k ＝l/σ _k ，b _k ＝-μ _k The final fisher layer is of the form:

wherein, an element operation; gamma ray _j (k) Is a softmax function, w _k ，b _k Is a parameter of the k-th Gaussian function of the GMM; gamma ray _j (k) Contains a common calculation part w _n ⊙(x _ij +b _n ) Is differential, other calculations are linear or square operations, are derivable; learning parameters through a directional propagation algorithm;

the simplified Fisher Vector algorithm is linear operation, so that the gradient of an error function to all weights and offset values can be calculated by a gradient descent method and an error back propagation mode in network training; in the large-scale image retrieval problem, a self-adaptive global binary feature acts on the first stage of a retrieval process, hamming distance matching is carried out in a database of a server end by using the global feature to obtain a candidate set, and a cross quotient loss function is selected:

wherein s is _i ＝[s _i1 ,...,s _iC ] ^T Representation image X _i A score vector of (a); y is _i ＝[y _i1 ,...,y _iC ] ^T Representing a tag vector; c is the number of classes in the dataset; σ (x) is the sigmoid function, i.e.:

σ(x)＝1/(1+exp(-x))。

the invention has the following advantages:

the invention uses a nonlinear dimension reduction algorithm CRBM to reduce the dimension of the local features of one image, and improves the effect of the aggregated global features by reducing the loss of the dimension reduction algorithm to the local feature information; and meanwhile, a Fisher network based on learning is adopted to generate more efficient Fisher Vector aggregation characteristics.

The invention provides a lightweight high-efficiency image retrieval system capable of being deployed at a mobile terminal, wherein in an aggregate global compact binary feature algorithm, a nonlinear descent algorithm CRBM is adopted to search subspace feature information of local feature essence of non-Gaussian distribution, and meanwhile, a Fisher-based network structure aggregate Fisher Vector is adopted to obtain more efficient global features; a scalar quantization algorithm and a bit self-adaptive algorithm are adopted to obtain compact self-adaptive characteristics, and the length of transmitted image characteristic information can be selected according to different self-adaptations of the network bandwidth of the mobile terminal; in the retrieval stage, a candidate set is obtained by using global characteristics to be roughly matched, and the accurate matching is carried out by using local characteristics to carry out geometric consistency check, so that the method is well suitable for a large-scale image retrieval task.

Drawings

FIG. 1 is a diagram showing the structure of a Fisher network according to the present invention.

Fig. 2 is a global binary compact feature aggregation flow diagram.

Detailed Description

The trained CRBM network and Fisher network are utilized to aggregate global compact binary features by using the following algorithm, specifically as follows:

1) Inputting:

a) Off-line training GMM model

b) Image X local feature SIFT set { X _j ,j＝1,...,t},

2) Image X local feature SIFT set { X _j ,j＝1...t},Each local feature x in _j ；

3) Using a continuous limited boltzmann machine to convert x _j Reducing the dimension from 128 dimensions to 32 dimensions;

4) Exiting the loop;

5) For each gaussian function i;

6) For each local feature SIFT descriptor j;

7) Computing local feature x _j Posterior probability gamma corresponding to ith Gaussian function _j (i)；

8) Exiting the loop;

9) Aggregating the Gaussian mean gradient vectors g of all local feature likelihoods for each Gaussian function i _ui Sum variance gradient vectorAre all 32-dimensional, i = 1.., 512);

10 ) exit the loop;

11 A pair ofVector uses SCFV scalar quantization method to obtain binary global features

12 A pair ofUsing bitsThe adaptive algorithm obtains compact descriptors g corresponding to 512bytes,1KB,2KB,4KB,8KB,16KB bitstreams.

And (3) outputting: fisher0/1 binarizes scalable compact descriptors.

The retrieval process is as follows:

after the local features extracted from the query image and the binary scalable compact global features, the retrieval stage is divided into two steps. The first step is as follows: matching a candidate set for the binary scalable compact global features of the image in a server database by using a Hamming distance; the second step: the candidate images are matched exactly using a geometric consistency check on the local features in the candidate set. And returns the search result.

The Fisher network structure diagram of the invention is shown in fig. 1, the calculation functions of each module are as follows:

(1) module calculation of y _{ijk_1} ＝w _k ⊙(X _ij +b _k ) Feature vector x _ij Corresponding to 512 Gaussian function output y _{ij1_1} ,y _{ij2_1} ,...,y _{ij512_1} }；

(2) Module calculation of y _{ijk_2} ＝(y _{ijk_1} ) ² ，{y _{ij1_1} ,y _{ij2_1} ,...,y _{ij512_1} The square of each element in the { outputs { y } _{ij1_2} ,y _{ij2_2} ,...,y _{ij512_2} }；

(3) Modular calculation formulaOutput { y _{ij1_3} ,y _{ij2_3} ,...,y _{ij512_3} }。

(4) Modular calculation formulaOutput { y _{ij1_4} ,y _{ij2_4} ,...,y _{ij512_4} }。

(5) Modular calculation formulaOutput { gamma } _j (1),γ _j (2),...,γ _j (k)}。

(6) Module calculation of y _{ijk_5} ＝y _{ijk_2} -1, output { y _{ij1_5} ,y _{ij2_5} ,...,y _{ij512_5} }。

(7) Modular computingOutput { y _{ij1_6} ,y _{ij2_6} ,...,y _{ij512_6} }。

(8) Modular calculation formulaOutput of

(9) Modular calculation formulaOutput of

Claims

1. The mobile visual search framework based on the CRBM and the Fisher network is characterized by comprising the following steps:

1) Constructing and training a continuous limited Boltzmann machine network;

2) And constructing and training a Fisher layer network.

2. The CRBM and Fisher network-based mobile visual search framework of claim 1, wherein in step 1), the specific method for constructing and training the continuous restricted boltzmann machine network is as follows:

(1) The construction method comprises the following steps: constructing a 3-layer continuous limited Boltzmann machine network, wherein the first layer comprises 128 units, the second layer comprises 64 units, and the third layer comprises 32 units; the former layer unit is a visual unit, and the latter layer unit is a hidden unit; the visible unit and the hidden unit are connected through full connection, and the connection weight is { w }; continuous limited glassThe structure of the continuous random unit is the same as that of the RBM, the continuous random unit comprises a visible layer and a hidden layer, interlayer units are connected with each other, information can flow in two directions when the network is trained and used, and the weights in the two directions are the same, namely w is w _ij ＝w _ji (ii) a Let s _j For the output of neuron j, the input neuron state is { s } _i H hidden layer state _j By s _j Indicating, visual layer state v _i By s _i Represents:

wherein

N _j (0,1) represents a gaussian random variable with a mean of 0 and a variance of 1; constants σ and N _j The product of (0,1) produces a gaussian noise input component n _j ＝σ·N _j (0,1) having a probability distribution of:

is a sigmoid function, θ _L And theta _H Respectively a lower asymptote and an upper asymptote of the sigmoid function; parameter a _j Controlling the slope of the sigmoid function; a is _j From small to large, the cell can transition smoothly from a deterministic state without noise to a binary random state; if a is _j Let the sigmoid function be linear over the noise range, then s _j Will obey an average ofAnd the variance is sigma ² (ii) a gaussian distribution of;

(2) The training method comprises the following steps:

the CRBM network parameters are trained by adopting a minimum contrast divergence weight updating algorithm, and only simple addition and multiplication operations are needed; MCD training criterion updating weight value omega _ij As well as the slope control parameter of sigmoid function { a } _j }：

Wherein, the first and the second end of the pipe are connected with each other,represents the one-time sampled state of neuron j,<·&the operation represents the mean value on the training set; the formula is simplified as:

simplified a _j And (3) updating an algorithm:

(3) And (3) supervision and fine adjustment:

the weight value obtained after the CRBM uses a contrast divergence algorithm approaches to a global optimal solution, and a back propagation algorithm is adopted for fine adjustment; desired output target { V' _i And input data { V } _i Equal; adjusting each weight gradient by utilizing the error between the output and the input of the calculation model until the error is converged; the parameters to be adjusted are: between layersConnection weight, bias weight of each layer and sigmoid slope parameter a _j (ii) a The objective function is:

where x is the network input data value, F _W，b，a (x) Outputting a value for the network; for each output neuron i of the L-th, output layer, the residual is as follows:

l = L-1,.. 2, the residual of the ith neuron node in each layer is:

f(z _i ) For the neuron activation function:

wherein:

h ^l ＝W ^l-1 ×s ^l-1 +b ^l-1 +σN(0,1)

z ^l ＝a ^l-1 ×h ^l

s ^l ＝f(z ^l )

3. The CRBM and Fisher network based mobile visual search framework of claim 1, wherein in step 2), the Fisher layer network is constructed and trained by the following specific method:

the gaussian mixture model is simplified for two points, assuming:

(1) Assuming equal weight per Gaussian function in GMM, i.e. ω _k ＝1；

(2) Simplification u _k (x) Is of the form:

Suppose w _k ＝1/σ _k ，b _k ＝-μ _k The final fisher layer is in the form:

the simplified Fisher Vector algorithm is linear operation, so that the gradient of an error function to all weights and offset values can be calculated by a gradient descent method and an error back propagation mode in network training; the CDVS mainly solves the problems of large-scale image retrieval and picture matching, in the problem of large-scale image retrieval, the self-adaptive global binary feature acts on the first stage of a retrieval process, hamming distance matching is carried out in a database of a server end by using the global feature to obtain a candidate set, and a cross quotient loss function is selected:

σ(x)＝1/(1+exp(-x))。