CN114612675A

CN114612675A - Visual saliency detection method and system based on multilayer non-local network

Info

Publication number: CN114612675A
Application number: CN202011337545.3A
Authority: CN
Inventors: 崔子冠; 沈婷婷; 王淑菲; 张一帆
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2022-06-10

Abstract

The invention discloses a visual saliency detection method and a system based on a multilayer non-local network, which are used for acquiring an image data set to be detected; inputting an image data set to be detected into a pre-trained significance detection model to obtain a significance image set; the pre-trained significance detection model is a multilayer non-local network model obtained by training with the minimum value of the loss function as a target. The advantages are that: the multi-layer non-local network of the method enlarges the receptive field of the model. In contrast to the repetition of the loop operation, the long-distance dependency can be captured directly by calculating the interaction between any two pixels using the non-local module without being constrained by the position-distance of the two pixels, improving efficiency and obtaining better results. In the process of training the model, a loss function consisting of relative entropy, the significance of a standardized scanning path and structural similarity is adopted, so that the optimal model obtained by training has better comprehensiveness and universality.

Description

Visual saliency detection method and system based on multilayer non-local network

Technical Field

The invention relates to a visual saliency detection method and system based on a multilayer non-local network, and belongs to the technical field of saliency detection in image processing.

Background

The human visual system can quickly find and pay attention to interesting contents from images, so that a visual selective attention mechanism simulating the human visual system is researched, an image saliency detection technology becomes a research hotspot in recent years, and the method has extremely important application in the aspects of target tracking, navigation, image quality evaluation, face recognition, video optimization coding and the like.

Saliency detection aims at identifying the most striking objects from input images, and since the development of 1998 to the present, with 2014 as the boundary, saliency target detection can be divided into two times, namely, a traditional method and a deep learning method.

In the last two decades, conventional salient object detection methods can be classified into two different categories according to the kind or usage characteristics of the visual subsets to be used: using a block-based visual subset or a region-based visual subset; and secondly, only internal clues provided by the image or external clues such as user comments are introduced. In summary, the conventional method uses a large amount of significance prior information for image significance detection, and mainly relies on manually made features, and these features may not describe complex image scenes and object structures, and may not adapt to new scenes and objects, and thus the generalization capability is poor, thereby causing the significance detection based on the conventional method to fall into a bottleneck.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a visual saliency detection method and system based on a multilayer non-local network.

In order to solve the above technical problems, the present invention provides a method for detecting visual saliency based on a multilayer non-local network,

acquiring an image data set to be detected;

inputting an image data set to be detected into a pre-trained significance detection model to obtain a significance image set; the pre-trained significance detection model is a multilayer non-local network model obtained by training with the minimum value of the loss function as a target.

Further, the multilayer non-local network model is a multilayer non-local neural network based on architecture of VGG 16.

Further, the multilayer non-local neural network based on the architecture of the VGG16 comprises an input layer, an encoder module, a multi-scale feature module, a decoder module and an output layer which are connected in sequence;

the encoder module comprises a first convolution unit, a second convolution unit, a third convolution unit, a fourth convolution unit and a fifth convolution unit which are connected in sequence; the first convolution unit and the second convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3 and a maximum pooling layer which are sequentially connected, and the third convolution unit, the fourth convolution unit and the fifth convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3, a maximum pooling layer and a non-local module layer which are sequentially connected;

the multi-scale feature module comprises a first convolution layer, a second expansion convolution layer, a third expansion convolution layer, a fourth expansion convolution layer, a mean value layer and a fifth convolution layer which are sequentially connected; the first convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the second expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 4, the third expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 8, the fourth expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 12, the fifth convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the input of the mean layer is the output of the fourth expansion convolutional layer, and under the condition that the input size is ensured to be unchanged, the input mean value is obtained and used as the input of the fifth convolutional layer;

the decoding module comprises a first deconvolution unit, a second deconvolution unit and a third deconvolution unit which are connected in sequence; the first deconvolution unit, the second deconvolution unit and the third deconvolution unit are all composed of an up-sampling layer and a convolution kernel layer with the convolution kernel of 3 x 3.

Further, the loss function is:

where N is the total number of images in the dataset, i represents the index of the ith image, KL_i、NSS_i、SSIM_iRelative entropy, normalized scan path saliency, and structural similarity, respectively, for the ith image.

A multi-layered non-local network based visual saliency detection system comprising:

the acquisition module is used for acquiring an image data set to be detected;

the processing module is used for inputting the image data set to be detected into a pre-trained significance detection model to obtain a significance image set; the pre-trained significance detection model is a multilayer non-local network model obtained by training with the minimum value of the loss function as a target.

Further, the processing module includes a model building module for building a multi-layer non-local neural network based on the architecture of VGG 16.

Further, the model building module is used for building an input layer, an encoder module, a multi-scale feature module, a decoder module and an output layer which are connected in sequence;

the encoder module comprises a first convolution unit, a second convolution unit, a third convolution unit, a fourth convolution unit and a fifth convolution unit which are sequentially connected; the first convolution unit and the second convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3 and a maximum pooling layer which are connected in sequence, and the third convolution unit, the fourth convolution unit and the fifth convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3, a maximum pooling layer and a non-local module layer which are connected in sequence;

the decoding module comprises a first deconvolution unit, a second deconvolution unit and a third deconvolution unit which are connected in sequence; the first deconvolution unit, the second deconvolution unit and the third deconvolution unit are all composed of an up-sampling layer and a convolution layer with convolution kernel of 3 x 3.

Further, the processing module comprises a loss function determining module for determining a loss function as follows:

The invention achieves the following beneficial effects:

1. the multi-layer non-local network of the method enlarges the receptive field of the model. In contrast to the repetition of the loop operation, the long-distance dependency can be captured directly by calculating the interaction between any two pixels using the non-local module without being constrained by the position-distance of the two pixels, improving efficiency and obtaining better results.

2. In the process of training the model, a loss function consisting of relative entropy, the significance of a standardized scanning path and the structural similarity is adopted, so that the optimal model obtained by training has better comprehensiveness and universality.

Drawings

FIG. 1 is a flow chart of a method for detecting visual saliency based on a multilayer non-local network according to the present invention;

FIG. 2 is a schematic diagram of a multi-layer non-local network;

fig. 3 is a comparison diagram of saliency maps obtained by algorithms of a test set in a data set SALICON and an MIT1003 test set according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

A visual saliency detection method based on a multilayer non-local network, as shown in FIG. 1, includes the following four steps:

step 1) constructing a multilayer non-local network on the basis of VGG 16.

In the implementation of the invention, the proposed multilayer non-local neural network is improved based on the architecture of VGG 16. The multilayer non-local network mainly comprises an encoder module, a multi-scale feature module and a decoder module which are connected in sequence. The encoder module utilizes convolution layers of different levels to reserve fine characteristics, and adds a non-local module at the last of the convolution units of the third, fourth and fifth for processing global information; the multi-scale characteristic module captures multi-scale image information by utilizing five convolutional layers with different expansion factors; the decoder module uses a combination of three upsampled layers and convolutional layers to restore the feature map to the resolution of the original image.

As shown in fig. 2, the multilayer non-local network packet includes an input layer, a first convolution unit, a second convolution unit, a third convolution unit, a fourth convolution unit, a fifth convolution unit, a multi-scale feature module, a decoder module, and an output layer, which are connected in sequence.

The first convolution unit includes two convolution layers of 3 × 3 × 64 and one maximum pooling layer connected in sequence.

The second convolution unit includes two convolution layers of 3 × 3 × 128 and one maximum pooling layer connected in sequence.

The third convolution unit includes three convolution layers of 3 × 3 × 256, a maximum pooling layer, and a non-local module layer, which are connected in sequence.

The fourth convolution unit includes three convolution layers of 3 × 3 × 512, a maximum pooling layer, and a non-local module layer, which are connected in sequence.

The fifth convolution unit includes three convolution layers of 3 × 3 × 512, a maximum pooling layer, and a non-local module layer, which are connected in sequence.

The multi-scale feature modules of the multilayer non-local network convolute and sample holes of different sampling rates for a given input in parallel, which is equivalent to capturing information of an image in multiple scales. The multi-scale feature module comprises a first convolution layer, a second expansion convolution layer, a third expansion convolution layer, a fourth expansion convolution layer, a mean value layer and a fifth convolution layer which are sequentially connected. The first convolutional layer is a convolutional layer with 256 convolutional kernels of 1 × 1, the second convolutional layer is a convolutional layer with 256 convolutional kernels of 3 × 3 and with an expansion rate of 4, the third convolutional layer is a convolutional layer with 256 convolutional kernels of 3 × 3 and with an expansion rate of 8, the fourth convolutional layer is a convolutional layer with 256 convolutional kernels of 3 × 3 and with an expansion rate of 12, the fifth convolutional layer is a convolutional layer with 256 convolutional kernels of 1 × 1, the input of the mean layer is the output of the fourth convolutional layer, and under the condition that the input size is guaranteed to be unchanged, the mean value of the input is obtained and is used as the input of the fifth convolutional layer.

The decoder module comprises a first deconvolution unit, a second deconvolution unit and a third deconvolution unit which are connected in sequence. Wherein the first deconvolution unit is composed of an upsampling layer and a convolutional layer composed of 128 3 × 3 convolution kernels, the second deconvolution unit is composed of an upsampling layer and a convolutional layer composed of 64 3 × 3 convolution kernels, and the third deconvolution unit is composed of an upsampling layer and a convolutional layer composed of 32 3 × 3 convolution kernels.

In the implementation of the invention, the improvement point of the multilayer non-local network to the VGG16 network is as follows:

(1) non-local modules are embedded on the basis of the VGG16 network, so that the salient features of the image can be captured by combining the local information and the global information of the image.

(2) The embedded three non-local modules are not directly embedded into the network as a whole, but are respectively positioned after the third, the fourth and the five convolution units of the coding module, so that the receptive field of the model is enlarged.

And 2) taking the minimum value of the loss function as a target training model.

In the implementation of the invention, a training set in a context significance data Set (SALICON) is used as a training data set to be input into the multilayer non-local network, the minimum value of a loss function is used as a target, the learning rate of training is set to be 0.00001, the iteration number is 1000, and the proposed multilayer non-local network is trained to obtain a significance detection model. Wherein the defined loss function is:

Relative entropy is typically used to measure the distance between two distributions, for saliency maps, i.e. the distance between the distribution of saliency map values and the distribution of prediction map values; the significance of the standardized scanning path is defined by the significance average value corresponding to the position of the human eye gaze point in the prediction model; the image structure similarity is an index for measuring the similarity degree of two images, and is mainly used for image quality evaluation, and the reason for using the image structure similarity for image significance evaluation is that the image structure similarity is calculated through a local window, the window is moved by taking a pixel as a unit until the local structure similarity index of each position is calculated, and then the average is taken as a global structure similarity index.

The loss function formed by the integration of the three indexes more comprehensively feeds back the quality of the model.

The improvement points in the implementation of the invention are as follows:

the loss function provided by the invention consists of three indexes for measuring the structural similarity of two images, namely the relative entropy, the significance of the standardized scanning path and the structural similarity of the images, the weight ratio of the three indexes is 1:1:1, the applicability of the model is enhanced, and the loss function comprehensively formed by the three indexes feeds back the quality of the model.

And 3) inputting the image data set to be detected into the significance detection model to obtain a significance image set.

And after the optimal model is obtained through the operation in the step two, a test set in a data set SALICON and a test set in a Massachusetts significance data set (MIT1003) are used as a test data set, and prediction is carried out on the trained model to obtain a significance image set. Fig. 3 is a schematic diagram showing a comparison between saliency maps of a test set and an MIT1003 test set in a data set saliconon provided by the implementation of the present invention obtained through algorithms, where an original image, a saliency map of a saliency attention model (SAM-VGG) based on an oxford university Visual Geometry Group (VGG) backbone network, a saliency map of a saliency attention model (SAM-ResNet) based on a residual error network (ResNet), a saliency map of the present invention, and a true thermodynamic diagram (GT) are sequentially shown from left to right, the first three original images in the original image are selected from the test sets in the data set saliconon, and the last original image is selected from the MIT1003 test set.

And 4) carrying out index evaluation on the obtained significance image set, wherein the index evaluation comprises relative entropy (KLD), normalized scanning path significance (NSS), Correlation Coefficient (CC) and area size under ROC curve (AUC-J).

Table 1 shows the AUC-J, KLD and CC comparison between the significance detection method based on the multilayer non-local network and other methods proposed by the invention.

TABLE 1

As can be seen from Table 1, the results of the method of the present invention are superior to those of other methods in the three indexes of AUC-J, KLD and CC, because the multi-layer non-local network combines global information and local information to capture significance characteristics, and combines a composite loss function to train a model, which has better comprehensiveness and universality.

Correspondingly, the invention also provides a visual saliency detection system based on the multilayer non-local network, which comprises:

the acquisition module is used for acquiring an image data set to be detected;

The processing module includes a model building module for building a multi-layer non-local neural network based on the architecture of VGG 16.

The model building module is used for building an input layer, an encoder module, a multi-scale feature module, a decoder module and an output layer which are connected in sequence;

The processing module comprises a loss function determination module for determining a loss function as follows:

According to the method, a multilayer non-local network is constructed on the basis of the VGG16 network, and the receptive field of the model is enlarged. Compared with the repeatability of the circular operation, the long-distance dependency relationship can be directly captured by calculating the interaction between any two pixels by using the non-local module without being constrained by the position distance of the two pixels, so that the efficiency is improved and a better result is obtained; in the process of training the model, a loss function consisting of relative entropy, the significance of a standardized scanning path and structural similarity is adopted, so that the optimal model obtained by training has better comprehensiveness and universality.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A visual saliency detection method based on a multilayer non-local network is characterized in that,

acquiring an image data set to be detected;

2. The visual saliency detection method based on multilayer non-local networks according to claim 1 characterized in that said multilayer non-local network model is a multilayer non-local neural network based on architecture of VGG 16.

3. The multilayer non-local network based visual saliency detection method of claim 2 characterized in that said multilayer non-local neural network of VGG16 based architecture comprises an input layer, an encoder module, a multi-scale feature module, a decoder module and an output layer connected in sequence;

4. The method according to claim 1, wherein the loss function is:

5. A visual saliency detection system based on multilayer non-local networks, characterized by comprising:

the acquisition module is used for acquiring an image data set to be detected;

6. The multi-layered non-local network-based visual saliency detection system of claim 5 characterized in that said processing module comprises a model construction module for constructing multi-layered non-local neural networks based on architecture of VGG 16.

7. The multilayer non-local network based visual saliency detection system of claim 6 characterized in that said model construction module is used to construct an input layer, an encoder module, a multi-scale feature module, a decoder module and an output layer connected in sequence;

the encoder module comprises a first convolution unit, a second convolution unit, a third convolution unit, a fourth convolution unit and a fifth convolution unit which are sequentially connected; the first convolution unit and the second convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3 and a maximum pooling layer which are sequentially connected, and the third convolution unit, the fourth convolution unit and the fifth convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3, a maximum pooling layer and a non-local module layer which are sequentially connected;

8. The multi-layer non-local network based visual saliency detection system of claim 5 characterized in that said processing module comprises a loss function determination module for determining the following loss functions: