CN112465024A

CN112465024A - Image pattern mining method based on feature clustering

Info

Publication number: CN112465024A
Application number: CN202011353678.XA
Authority: CN
Inventors: 梁雪峰; 王倩楠; 朱照延; 石惠文; 周颖
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-09

Abstract

The invention discloses an image mode mining method based on feature clustering, which mainly solves the problem that a visual mode mined by the prior art cannot have discriminability and frequency at the same time and has the scheme that a picture is obtained and divided into a training set and a test set; training an AlexNet network by using a training set; screening pictures from the trained network for mining visual representation; acquiring high-level correlation characteristics according to the hierarchical correlation back propagation network; clustering the high-level relevant features to obtain frequent relevant features; the relevant features are propagated backwards to obtain a visual representation with discriminability and frequency. The method converts the mode mining task into the classification task, improves the discriminability, improves the frequency by carrying out density clustering on the related characteristics, and transmits the hierarchical relevance back to the original image to position the representative area in the original image to obtain the visual mode, improves the discriminability and the frequency of the mining visual mode, and can be used for representing the image mode in natural scenes and tourism.

Description

Image pattern mining method based on feature clustering

Technical Field

The invention belongs to the technical field of image processing, and further relates to an image mode mining method which can be used for representing image modes in natural scenes and travels.

Background

The feature clustering refers to clustering the features learned by the network in the neural network, and the unsupervised learning algorithm can well ensure that the distance between similar samples in a feature space is short and the distance between different samples is long. The clustering algorithm is a typical unsupervised learning algorithm and is mainly used for automatically classifying similar samples into a category. In the clustering algorithm, samples are divided into different categories according to the similarity among the samples, different clustering results can be obtained by using different similarity calculation methods, and a common similarity calculation method is an Euclidean distance method.

Pattern mining is an important topic in data mining research, and is the basis of many important data mining tasks such as association rules, correlation analysis, sequence patterns, causal relationships, plot segments, local periodicity, exposure patterns and the like. Therefore, the frequent mode has wide application, such as shopping basket data analysis, cross shopping, web page prefetching, personalized website, etc.

In recent years, how to mine visual patterns from a large number of photos has become an important problem, and some workers have studied on the problem. But in the past people have mainly used traditional methods to extract features of images. For example, David G Lowe et al, 1999 in the article Object recognition from local scale-innovative feature, proposed a feature extraction method SIFT, Carl Doersch et al, 2015 in the article at maps book local like part? In the method, a feature extraction method HOG is provided, and the local features extracted by the methods have limited capability in representing semantic information of an image. Later, researchers used convolutional neural networks to extract features that were able to learn the ability to stratify and high-level semantic representations of images, which were exploited by people to mine visual patterns in photographs.

Et al, in its published paper "Mining mid-level visual patterns with deep CNN activities" (2017 IJCV conference paper), propose a pattern Mining method based on convolutional neural network CNN and association rules. The method uses a convolutional neural network for feature representation and association rules for mining visual patterns. Because the discriminative information is at the position where the CNN activation value response is large, the discriminative performance of the mode is ensured by extracting the top K activation value index, and then association rule mining is carried out after the discrete indexes are converted into transactions, so that the visual mode which is frequently discriminated is obtained. The method has the disadvantages that some judgment information is lost in the mode of dividing the image into the image blocks, and the occupied memory is excessive.

Zhang W et al propose a binary pattern search based method of mining visual patterns in its published paper, "Binarized mode search for scalable visual pattern discovery" (the 2017 CVPR conference paper). The method comprises the steps of inputting an image into a VGG19 network, extracting 4096-dimensional features of FC7, converting the features from Euclidean space to binary space to reduce feature storage, clustering the image by using a mean shift algorithm, determining the frequency of the image, and finding out frequent and discriminant images by introducing a contrast set. The method has the disadvantages that only frequent and distinguished images can be found, and the mode in the images cannot be positioned.

Yang L et al, in their published paper "Learning discrete visual elements using part-based connected neural network" (2018 neuro-computing conference paper), propose to use the hierarchical abstraction principle of convolutional neural networks and maximum threshold analysis to add part-level structure in the network, where the structure is composed of conv, SPP, and Relu, and locate patterns with discriminant in the image by using unsupervised maximum threshold analysis method. The disadvantage of this method is that the frequency of the found patterns cannot be guaranteed.

Hongzhi Li et al, in their published paper "Pattern: Visual Pattern mining with deep neural network" (the 2018 ICMR conference paper), propose to use the filter of the last convolutional layer of the convolutional neural network to find the Visual pattern. The implementation scheme is as follows: firstly, the last convolutional layer of a pre-trained Alexnet network is connected with a global maximum pooling layer, 256-dimensional vectors are generated, 20 output neurons of a full connection layer are set, the previous parameters are fixed, only the full connection layer is trained, and the number of the output neurons is also the number of visual patterns; and then finding the first three convolution kernels with the largest contribution corresponding to each visual mode, and performing deconvolution on feature maps of the three convolution kernels to find the position of the visual mode corresponding to the original image. The method has the disadvantages that without theoretical basis, the found visual mode only comes from the maximum pooling layer, and the visual mode cannot be guaranteed to frequently appear in the image data set.

Disclosure of Invention

The invention aims to provide an image pattern mining method based on feature clustering to find a visual pattern with both discriminability and frequency in travel data aiming at the defects of the prior art.

The technical idea for realizing the aim of the invention is as follows: finding out pictures with discriminability by designing an image classification task, and finding out visual representation with frequency by a density clustering algorithm; visual patterns in an image are located by hierarchical relevance propagation.

According to the technical idea, the specific implementation of the invention comprises the following steps:

(1) acquiring pictures and dividing the pictures into a training set and a test set:

(1a) acquiring 20 types of picture data, wherein 10 thousands of pictures are obtained in total;

(1b) 1000 pictures are selected from each class, and in total, twenty thousand pictures are used as a test set for mining visual representation, and the rest pictures are used as a training set for training the convolutional neural network.

(2) Training a classification network AlexNet:

(2a) scaling the pictures in the training set to 227 x 227;

(2b) and inputting the zoomed picture into an AlexNet network to train the network until the network converges to obtain a fine-tuned classification network, namely an AlexNet model, wherein the AlexNet model comprises a feature extraction layer and a classification layer, the feature extraction layer comprises 5 convolution layers, and the classification layer comprises a first full-connection layer, a second full-connection layer and an output layer.

(3) The screening pictures are used to mine the visual representation:

(3a) setting a screening threshold T_p；

(3b) Inputting the test set into a classification network AlexNet model to obtain a result which meets the requirement that the output layer is larger than a threshold value T_pThe picture of (2) is regarded as a discriminant picture.

(4) Acquiring related characteristics of a high layer:

(4a) and (3) reversely propagating the network output to a second full-connection layer of the AlexNet model by using hierarchical correlation propagation to obtain high-level correlation characteristics of the image with discriminability for characteristic clustering.

(5) Obtaining the related characteristics with frequency:

(5a) clustering the high-level related features by using a density-based clustering algorithm;

(5b) and selecting the first 20 related feature vectors with the maximum density in each cluster in the clustering result so as to obtain high-level related features with frequency.

(6) Obtaining a visual representation with discriminability and frequency:

(6a) according to the hierarchical relevance propagation, continuously performing backward propagation in 5 convolutional layers of the AlexNet model until the propagation reaches the original image in the input layer;

(6b) the areas of the original image corresponding to the high-level related features are the visual patterns mined by the invention and having discriminability and frequency.

Compared with the prior art, the invention has the following advantages:

first, the present invention is able to locate discriminative visual patterns from images by using hierarchical correlation propagation;

secondly, clustering the features in a feature space by using density-based related feature clustering to find a frequent visual pattern;

thirdly, the invention can find out the visual mode with both discriminability and frequency by combining the hierarchical correlation propagation and the density-based correlation feature clustering, thereby overcoming the defect that the prior art can only find out the visual mode with discriminability or frequency.

Experiments show that the discriminativity of the mined visual mode is higher than that of other advanced methods, and the frequency of the visual mode is higher than that of other advanced methods.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

fig. 2 is ten visual representations mined from five types of pictures using the present invention.

Fig. 3 is a ten-visual representation of a video image mined from a class of pictures using the present invention and four other advanced methods.

Detailed Description

The embodiment and effects of the present invention will be further described with reference to fig. 1.

Referring to fig. 1, the specific steps of this embodiment are as follows.

Step 1, obtaining pictures and dividing the pictures into a training set and a testing set.

1.1) acquiring 20 types of picture data from a TripAdvisor website, wherein 10 thousands of pictures are obtained in total;

1.2) selecting 1000 pictures from each class, taking twenty thousand pictures as a test set for mining visual representation, and taking the rest pictures as a training set for training a convolutional neural network.

And 2, training a classification network AlexNet.

2.1) scaling the pictures in the training set to 227 x 227;

2.2) inputting the scaled picture into an AlexNet network to carry out iterative training on the network:

2.2.1) selecting cross entropy loss as a loss function, selecting Adam as an optimizer, setting the learning rate to be 0.0001, and setting the number of the neurons of the network output layer to be 20;

2.2.2) initializing AlexNet network parameters, setting the initial iteration number K to be 1 and setting the learning rate L to be 1 e-3;

2.2.3) calculating the network loss L using the Cross entropy loss function_CE；

Where m is the number of classes, n is the number of class images, y_jiIs a label for the image or images,

is the output of the network;

2.2.4) determination of loss L_CEAnd if so, adding 1 to K and returning to 2.2.3, otherwise, stopping training and storing the AlexNet network model at the moment when the loss begins to oscillate and is not reduced any more.

And 3, screening pictures for mining visual representation.

3.1) setting the screening threshold T_p；

3.2) inputting the test set into a classification network AlexNet model to obtain a result which meets the requirement that the output layer is larger than a threshold value T_pThe picture of (2) is regarded as a discriminant picture.

And 4, acquiring high-level related characteristics.

Using hierarchical correlation propagation, outputs greater than a threshold T will be satisfied_pThe test set pictures of (2) continue to propagate back in the model:

the ith neuron of the l layer in the network passes weight w_ijAnd an activation function h () inputting a set of x_iMapping to output x_j，

And by outputting x_jAll correlations R of_jCalculating an input neuron x_iCorrelation of (2)_iThe back propagation is performed according to the following formula:

wherein the content of the first and second substances,

is the correlation of the ith neuron in the ith layer,

is the correlation of the jth neuron at layer l +1,

is the activation value of the kth neuron of the l-th layer,

is the weight between the jth neuron at layer l +1 and the kth neuron at layer l,

is the weight between the jth neuron at level l +1 and the ith neuron at level l.

Outputting the network more than the threshold T according to the back propagation rule_pThe test set picture is reversely propagated to a second full connection layer of the AlexNet model, and relevant features of the layer are stored for density clustering.

And 5, obtaining the related characteristics with frequency.

5.1) clustering the high-level related features by using a density-based clustering algorithm:

5.1.1) setting the parameter radius r to be 0.35 and the minimum point number m to be 20;

5.1.2) in the relevant feature space, marking each relevant feature vector:

if the number of points contained in the radius of one feature point is larger than the minimum number of points m, the feature point is marked as a core point;

if a feature point contains less than the minimum number of points within its radius but contains a core point, the feature point is labeled

Recording as a boundary point;

if a feature point contains less than the minimum number of points within its radius and does not contain a core point, the feature point is marked as a noise point.

5.1.3) connecting the core points with boundary points which belong to the core points to form clusters which are a type of mined visual patterns;

and 5.2) selecting the first 20 related feature vectors with the maximum density in each cluster in the clustering result, thereby obtaining high-level related features with frequency.

And 6, obtaining a visual representation with discriminability and frequency.

According to the hierarchical relevance propagation, continuously performing backward propagation in 5 convolutional layers of the AlexNet model until the propagation reaches the original image in the input layer;

the areas of the original image corresponding to these high-level related features are the mined visual patterns with both discriminability and frequency.

The effect of the present invention is further explained with the simulation as follows:

the four existing methods used in the simulation experiment are:

the 1 type is a mode excavation method, MDPM method for short, proposed by Yao Li et al in Mining mid-level visual patterns with deep cNn activities. IJCV, vol.121, No.3, pp.344-364,2017.

2 is the pattern mining method proposed by Wei Zhang et al in "Binarized mode search for scalable visual pattern discovery," in CVPR,2017, pp.3864-3872 ", which is abbreviated as CBMS method.

3, a pattern mining method, called P-CNN method for short, proposed by Lingxiao Yang et al in "Learning discrete visual elements using part-based connected logical network. neural rendering, vol.316, pp.135-143,2018.

4 types of the method are pattern mining methods proposed by Hongzhi Li et al in "Visual pattern mining with deep neural network.in ICMR,2018, pp.291-299", which are called pattern Net methods for short.

1. Simulation experiment conditions are as follows:

the hardware platform of the simulation experiment of the invention is as follows: the Dall computer has the CPU model of Intel (R) E5-2603, the frequency of 1.60GHz, the GPU model of GeForce GTX 2080 and the video memory 11G.

The software platform of the simulation experiment of the invention is as follows: ubuntu 18.0 system, Python 3.6, pyrroch 1.2.0.

The simulation experiment of the invention uses 20 types of picture data of input images with more than 10 ten thousand pictures, wherein each type of data exceeds 3500 pictures, and the pictures are divided into two parts for experiment: firstly, 20000 pictures in the test set: each type has 1000 pictures for mining visual representation; and secondly, training the convolutional neural network model by using 8 ten thousand pictures in the training set.

2. Simulation content and result analysis thereof:

simulation experiment 1, pattern mining is carried out on five types of picture data by using the DRFC method under the simulation condition, and the obtained result is shown in figure 2. Wherein:

FIG. 2(a) from The Little Mermaid class, FIG. 2(b) from The Santa Justa Lift class, FIG. 2(c) from The Lisbon District Central Port class, FIG. 2(d) from The Merlion Park Singapore class, and FIG. 2(e) from The Kiyomizu Dera Temple class.

As can be seen from FIG. 2, the visual representation mined by the present invention contains representative targets and represents these different categories of data well. Surprisingly, the number of mined visual representations of each category may be more than one. Representative representations of statues, trams and pagodas are shown in fig. 2(a), 2(c) and 2(e), respectively, however, there are two types of visual representations for each sight in fig. 2(b) and 2 (d). Marked by red boxes, i.e. black boxes and green boxes, i.e. grey boxes in the corresponding grey map, respectively. And figure 2(b) shows two different viewing angles including looking up the tower and looking down from the top of the tower, and figure 2(d) shows two views of the figurine, day and night.

Simulation experiment 2, the present invention and the above four existing methods MDPM, CBMS, P-CNN, pattern net are used to perform pattern mining on a class of picture data under the above simulation conditions, and the obtained result is shown in fig. 3.

Fig. 3(a) shows a visual pattern mining experiment performed on a type of picture data by the MDPM method in the prior art under the above simulation conditions.

Fig. 3(b) shows a CBMS method in the prior art performing a visual pattern mining experiment on a type of picture data under the above simulation conditions.

FIG. 3(c) is a prior art P-CNN method performing a visual pattern mining experiment on a type of picture data under the above simulation conditions.

Fig. 3(d) is a visual pattern mining experiment performed on a type of picture data by the pattern net method in the prior art under the above simulation conditions.

Fig. 3(e) shows that the DRFC method of the present invention performs a visual pattern mining experiment on a type of picture data under the above simulation conditions.

As can be seen from the comparison of fig. 3, the mining result of the MDPM method is the worst, marked with a blue box, i.e., a gray box in the corresponding gray-scale map, because the method of mining the visual pattern using the image block may lose a part of the representative objects. CBMS only finds frequent images and not visual patterns of images, marked with yellow boxes, i.e. white boxes in the corresponding gray map. The P-CNN and the Pattern Net can find visual patterns, but some visual patterns without targets or with incomplete targets are mined, and the visual patterns are marked by red frames in the graph, namely the black frames in the corresponding gray-scale graph, and on the contrary, the invention can find consistency examples of visual representation.

Simulation experiment 3, using the present invention and the above four existing methods MDPM, CBMS, P-CNN, patternenet to evaluate the discriminability of the mined visual pattern under the above simulation conditions:

all calculations are plotted in table 1:

TABLE 1 comparison of Pattern Classification accuracy of the present invention and various prior art miners in simulation experiments

Method of producing a composite material	MDPM	CBMS	P-CNN	PatternNet	The invention
						Precision (%)	84.08	94.75	96.75	90.00	99.54

As can be seen from Table 1, the average accuracy of the method is 99.54%, which is higher than that of the four prior art methods, and the visual mode obtained by the method is proved to have high discriminability. The MDPM accuracy is lowest because this method samples the image into image blocks, which loses discrimination information.

Simulation experiment 4, the frequency of the excavated visual patterns is evaluated by using the present invention and the above existing four methods, MDPM, CBMS, P-CNN, patternenet, respectively under the above simulation conditions:

the frequency rate FR is calculated using the following formula, and all the calculation results are plotted in table 2:

wherein the content of the first and second substances,

is the degree of similarity of the cosine of the line,

and

all from the signature of the last convolutional layer of the network,

is a feature map of one of the photos from the w-th attraction,

is a feature map of a visual representation mined from the w-th attraction, N_w，N_uAnd N is the number of sights, the number of visual representations and the number of photographs in each sight, T, respectively_fIs a similarity threshold.

TABLE 2. different methods at different cosine similarity thresholds T_fComparison of Pattern Frequency Rate (FR) of Down mining

As can be seen from Table 2, the frequency FR of the visual representation mined by the invention is at all threshold values T_fThe following is the highest, although MDPM uses a frequent pattern mining algorithm, the result is the worst, CBMS uses a mean shift algorithm to find frequent images, and the result is lower than that of P-CNN and the present invention. Pattern Net and P-CNN focus on mining patterns of discriminant, which are relatively frequent.

The above simulation experiments show that: the invention provides a DRFC (feature clustering based) method for solving the problem of mining visual patterns from massive photos, compared with the existing mode which is researched and mined only frequently or discriminately, the visual patterns mined by the method can simultaneously have discriminativity and frequency, classification experiments and frequency rate experiments also prove that the method has higher precision and frequency rate compared with other four advanced methods, and experimental results also show the effectiveness of the method in mining visual pattern tasks.

Claims

1. An image pattern mining method based on feature clustering is characterized by comprising the following steps:

(1) acquiring 20 types of picture data, wherein 10 thousands of pictures are obtained in total; selecting 1000 pictures from each type of pictures, taking twenty thousand pictures as a test set for mining visual representation, and taking the rest pictures as a training set for training a convolutional neural network;

(2) the method comprises the steps that pictures in a training set are scaled to 227 x 227, the scaled pictures are input into an AlexNet network to train the network until the network converges, and a fine-tuned classification network, namely an AlexNet model, is obtained, wherein the AlexNet model comprises a feature extraction layer and a classification layer, the feature extraction layer comprises 5 convolution layers, and the classification layer comprises a first full-connection layer, a second full-connection layer and an output layer;

(3) setting a screening threshold T_pInputting the test set into a classification network AlexNet model to obtain a result which meets the requirement that the output layer is larger than a threshold value T_pThe picture of (2) is used as a picture with discriminability;

(4) using hierarchical correlation propagation, the satisfied output in (3) is greater than a threshold T_pThe test set picture is continuously propagated in the reverse direction in the model until the test set picture is propagated to a second full connection layer of the AlexNet model, and high-level related features of the picture with discriminability are obtained for feature clustering;

(5) performing feature clustering on the high-level relevant features obtained in the step (4) by using a density-based clustering algorithm, and selecting the first 20 relevant feature vectors with the maximum density in each cluster in a clustering result to obtain high-level relevant features with frequency;

(6) and (4) according to the hierarchical correlation propagation, continuously performing backward propagation on the high-level correlation features with the frequency obtained in the step (5) in 5 convolutional layers of the AlexNet model until the high-level correlation features are propagated to the original image in the input layer, wherein the areas corresponding to the high-level correlation features in the original image are the mined visual modes with the distinguishing property and the frequency.

2. The method of claim 1, wherein: (2) the zoomed picture is input into an AlexNet network to train the network, and the implementation steps are as follows:

(2a) initializing AlexNet network parameters, setting an initial iteration number K to be 1 and setting a learning rate L to be 1 e-3;

(2b) computing network loss L using cross entropy loss function_CE；

is the output of the network;

(2c) judging the loss L_CEAnd (4) if the AlexNet network model is not reduced, adding 1 to K and returning to the step (2b), otherwise, stopping training when the loss begins to oscillate and is not reduced any more, and storing the AlexNet network model at the moment.

3. The method of claim 1, wherein: (4) using hierarchical correlation propagation as described in (1), the satisfied output in (3) is greater than a threshold T_pThe test set pictures of (2) continue to propagate back in the model, which is implemented as follows:

(4a) the ith neuron of the l layer in the network passes weight w_ijAnd an activation function h () inputting a set of x_iMapping to output x_j，

wherein the content of the first and second substances,

is the correlation of the ith neuron in the ith layer,

is the correlation of the jth neuron at layer l +1,

is the activation value of the kth neuron of the l-th layer,

is the weight between the jth neuron at level l +1 and the ith neuron at level l;

(4b) outputting the network more than the threshold T according to the back propagation rule_pThe test set picture is reversely propagated to a second full connection layer of the AlexNet model, and relevant features of the layer are stored for density clustering.

4. The method of claim 1, wherein: (5) the method for clustering the relevant features by using the clustering method based on the density comprises the following implementation steps:

(5a) setting the parameter radius r to be 0.35 and the minimum point number m to be 20;

(5b) in the relevant feature space, marking each relevant feature vector:

if a feature point contains less than the minimum number of points within its radius but contains a core point, the feature point is marked as a boundary point;

(5c) The core points and the boundary points which belong to the core points are connected to form clusters, and the clusters are a type of visual patterns which are mined.