CN114612675A - Visual saliency detection method and system based on multilayer non-local network - Google Patents

Visual saliency detection method and system based on multilayer non-local network Download PDF

Info

Publication number
CN114612675A
CN114612675A CN202011337545.3A CN202011337545A CN114612675A CN 114612675 A CN114612675 A CN 114612675A CN 202011337545 A CN202011337545 A CN 202011337545A CN 114612675 A CN114612675 A CN 114612675A
Authority
CN
China
Prior art keywords
convolution
layer
unit
convolutional layer
expansion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011337545.3A
Other languages
Chinese (zh)
Inventor
崔子冠
沈婷婷
王淑菲
张一帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202011337545.3A priority Critical patent/CN114612675A/en
Publication of CN114612675A publication Critical patent/CN114612675A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a visual saliency detection method and a system based on a multilayer non-local network, which are used for acquiring an image data set to be detected; inputting an image data set to be detected into a pre-trained significance detection model to obtain a significance image set; the pre-trained significance detection model is a multilayer non-local network model obtained by training with the minimum value of the loss function as a target. The advantages are that: the multi-layer non-local network of the method enlarges the receptive field of the model. In contrast to the repetition of the loop operation, the long-distance dependency can be captured directly by calculating the interaction between any two pixels using the non-local module without being constrained by the position-distance of the two pixels, improving efficiency and obtaining better results. In the process of training the model, a loss function consisting of relative entropy, the significance of a standardized scanning path and structural similarity is adopted, so that the optimal model obtained by training has better comprehensiveness and universality.

Description

Visual saliency detection method and system based on multilayer non-local network
Technical Field
The invention relates to a visual saliency detection method and system based on a multilayer non-local network, and belongs to the technical field of saliency detection in image processing.
Background
The human visual system can quickly find and pay attention to interesting contents from images, so that a visual selective attention mechanism simulating the human visual system is researched, an image saliency detection technology becomes a research hotspot in recent years, and the method has extremely important application in the aspects of target tracking, navigation, image quality evaluation, face recognition, video optimization coding and the like.
Saliency detection aims at identifying the most striking objects from input images, and since the development of 1998 to the present, with 2014 as the boundary, saliency target detection can be divided into two times, namely, a traditional method and a deep learning method.
In the last two decades, conventional salient object detection methods can be classified into two different categories according to the kind or usage characteristics of the visual subsets to be used: using a block-based visual subset or a region-based visual subset; and secondly, only internal clues provided by the image or external clues such as user comments are introduced. In summary, the conventional method uses a large amount of significance prior information for image significance detection, and mainly relies on manually made features, and these features may not describe complex image scenes and object structures, and may not adapt to new scenes and objects, and thus the generalization capability is poor, thereby causing the significance detection based on the conventional method to fall into a bottleneck.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a visual saliency detection method and system based on a multilayer non-local network.
In order to solve the above technical problems, the present invention provides a method for detecting visual saliency based on a multilayer non-local network,
acquiring an image data set to be detected;
inputting an image data set to be detected into a pre-trained significance detection model to obtain a significance image set; the pre-trained significance detection model is a multilayer non-local network model obtained by training with the minimum value of the loss function as a target.
Further, the multilayer non-local network model is a multilayer non-local neural network based on architecture of VGG 16.
Further, the multilayer non-local neural network based on the architecture of the VGG16 comprises an input layer, an encoder module, a multi-scale feature module, a decoder module and an output layer which are connected in sequence;
the encoder module comprises a first convolution unit, a second convolution unit, a third convolution unit, a fourth convolution unit and a fifth convolution unit which are connected in sequence; the first convolution unit and the second convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3 and a maximum pooling layer which are sequentially connected, and the third convolution unit, the fourth convolution unit and the fifth convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3, a maximum pooling layer and a non-local module layer which are sequentially connected;
the multi-scale feature module comprises a first convolution layer, a second expansion convolution layer, a third expansion convolution layer, a fourth expansion convolution layer, a mean value layer and a fifth convolution layer which are sequentially connected; the first convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the second expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 4, the third expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 8, the fourth expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 12, the fifth convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the input of the mean layer is the output of the fourth expansion convolutional layer, and under the condition that the input size is ensured to be unchanged, the input mean value is obtained and used as the input of the fifth convolutional layer;
the decoding module comprises a first deconvolution unit, a second deconvolution unit and a third deconvolution unit which are connected in sequence; the first deconvolution unit, the second deconvolution unit and the third deconvolution unit are all composed of an up-sampling layer and a convolution kernel layer with the convolution kernel of 3 x 3.
Further, the loss function is:
Figure BDA0002797702600000021
where N is the total number of images in the dataset, i represents the index of the ith image, KLi、NSSi、SSIMiRelative entropy, normalized scan path saliency, and structural similarity, respectively, for the ith image.
A multi-layered non-local network based visual saliency detection system comprising:
the acquisition module is used for acquiring an image data set to be detected;
the processing module is used for inputting the image data set to be detected into a pre-trained significance detection model to obtain a significance image set; the pre-trained significance detection model is a multilayer non-local network model obtained by training with the minimum value of the loss function as a target.
Further, the processing module includes a model building module for building a multi-layer non-local neural network based on the architecture of VGG 16.
Further, the model building module is used for building an input layer, an encoder module, a multi-scale feature module, a decoder module and an output layer which are connected in sequence;
the encoder module comprises a first convolution unit, a second convolution unit, a third convolution unit, a fourth convolution unit and a fifth convolution unit which are sequentially connected; the first convolution unit and the second convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3 and a maximum pooling layer which are connected in sequence, and the third convolution unit, the fourth convolution unit and the fifth convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3, a maximum pooling layer and a non-local module layer which are connected in sequence;
the multi-scale feature module comprises a first convolution layer, a second expansion convolution layer, a third expansion convolution layer, a fourth expansion convolution layer, a mean value layer and a fifth convolution layer which are sequentially connected; the first convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the second expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 4, the third expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 8, the fourth expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 12, the fifth convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the input of the mean layer is the output of the fourth expansion convolutional layer, and under the condition that the input size is ensured to be unchanged, the input mean value is obtained and used as the input of the fifth convolutional layer;
the decoding module comprises a first deconvolution unit, a second deconvolution unit and a third deconvolution unit which are connected in sequence; the first deconvolution unit, the second deconvolution unit and the third deconvolution unit are all composed of an up-sampling layer and a convolution layer with convolution kernel of 3 x 3.
Further, the processing module comprises a loss function determining module for determining a loss function as follows:
Figure BDA0002797702600000031
where N is the total number of images in the dataset, i represents the index of the ith image, KLi、NSSi、SSIMiRelative entropy, normalized scan path saliency, and structural similarity, respectively, for the ith image.
The invention achieves the following beneficial effects:
1. the multi-layer non-local network of the method enlarges the receptive field of the model. In contrast to the repetition of the loop operation, the long-distance dependency can be captured directly by calculating the interaction between any two pixels using the non-local module without being constrained by the position-distance of the two pixels, improving efficiency and obtaining better results.
2. In the process of training the model, a loss function consisting of relative entropy, the significance of a standardized scanning path and the structural similarity is adopted, so that the optimal model obtained by training has better comprehensiveness and universality.
Drawings
FIG. 1 is a flow chart of a method for detecting visual saliency based on a multilayer non-local network according to the present invention;
FIG. 2 is a schematic diagram of a multi-layer non-local network;
fig. 3 is a comparison diagram of saliency maps obtained by algorithms of a test set in a data set SALICON and an MIT1003 test set according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
A visual saliency detection method based on a multilayer non-local network, as shown in FIG. 1, includes the following four steps:
step 1) constructing a multilayer non-local network on the basis of VGG 16.
In the implementation of the invention, the proposed multilayer non-local neural network is improved based on the architecture of VGG 16. The multilayer non-local network mainly comprises an encoder module, a multi-scale feature module and a decoder module which are connected in sequence. The encoder module utilizes convolution layers of different levels to reserve fine characteristics, and adds a non-local module at the last of the convolution units of the third, fourth and fifth for processing global information; the multi-scale characteristic module captures multi-scale image information by utilizing five convolutional layers with different expansion factors; the decoder module uses a combination of three upsampled layers and convolutional layers to restore the feature map to the resolution of the original image.
As shown in fig. 2, the multilayer non-local network packet includes an input layer, a first convolution unit, a second convolution unit, a third convolution unit, a fourth convolution unit, a fifth convolution unit, a multi-scale feature module, a decoder module, and an output layer, which are connected in sequence.
The first convolution unit includes two convolution layers of 3 × 3 × 64 and one maximum pooling layer connected in sequence.
The second convolution unit includes two convolution layers of 3 × 3 × 128 and one maximum pooling layer connected in sequence.
The third convolution unit includes three convolution layers of 3 × 3 × 256, a maximum pooling layer, and a non-local module layer, which are connected in sequence.
The fourth convolution unit includes three convolution layers of 3 × 3 × 512, a maximum pooling layer, and a non-local module layer, which are connected in sequence.
The fifth convolution unit includes three convolution layers of 3 × 3 × 512, a maximum pooling layer, and a non-local module layer, which are connected in sequence.
The multi-scale feature modules of the multilayer non-local network convolute and sample holes of different sampling rates for a given input in parallel, which is equivalent to capturing information of an image in multiple scales. The multi-scale feature module comprises a first convolution layer, a second expansion convolution layer, a third expansion convolution layer, a fourth expansion convolution layer, a mean value layer and a fifth convolution layer which are sequentially connected. The first convolutional layer is a convolutional layer with 256 convolutional kernels of 1 × 1, the second convolutional layer is a convolutional layer with 256 convolutional kernels of 3 × 3 and with an expansion rate of 4, the third convolutional layer is a convolutional layer with 256 convolutional kernels of 3 × 3 and with an expansion rate of 8, the fourth convolutional layer is a convolutional layer with 256 convolutional kernels of 3 × 3 and with an expansion rate of 12, the fifth convolutional layer is a convolutional layer with 256 convolutional kernels of 1 × 1, the input of the mean layer is the output of the fourth convolutional layer, and under the condition that the input size is guaranteed to be unchanged, the mean value of the input is obtained and is used as the input of the fifth convolutional layer.
The decoder module comprises a first deconvolution unit, a second deconvolution unit and a third deconvolution unit which are connected in sequence. Wherein the first deconvolution unit is composed of an upsampling layer and a convolutional layer composed of 128 3 × 3 convolution kernels, the second deconvolution unit is composed of an upsampling layer and a convolutional layer composed of 64 3 × 3 convolution kernels, and the third deconvolution unit is composed of an upsampling layer and a convolutional layer composed of 32 3 × 3 convolution kernels.
In the implementation of the invention, the improvement point of the multilayer non-local network to the VGG16 network is as follows:
(1) non-local modules are embedded on the basis of the VGG16 network, so that the salient features of the image can be captured by combining the local information and the global information of the image.
(2) The embedded three non-local modules are not directly embedded into the network as a whole, but are respectively positioned after the third, the fourth and the five convolution units of the coding module, so that the receptive field of the model is enlarged.
And 2) taking the minimum value of the loss function as a target training model.
In the implementation of the invention, a training set in a context significance data Set (SALICON) is used as a training data set to be input into the multilayer non-local network, the minimum value of a loss function is used as a target, the learning rate of training is set to be 0.00001, the iteration number is 1000, and the proposed multilayer non-local network is trained to obtain a significance detection model. Wherein the defined loss function is:
Figure BDA0002797702600000061
where N is the total number of images in the dataset, i represents the index of the ith image, KLi、NSSi、SSIMiRelative entropy, normalized scan path saliency, and structural similarity, respectively, for the ith image.
Relative entropy is typically used to measure the distance between two distributions, for saliency maps, i.e. the distance between the distribution of saliency map values and the distribution of prediction map values; the significance of the standardized scanning path is defined by the significance average value corresponding to the position of the human eye gaze point in the prediction model; the image structure similarity is an index for measuring the similarity degree of two images, and is mainly used for image quality evaluation, and the reason for using the image structure similarity for image significance evaluation is that the image structure similarity is calculated through a local window, the window is moved by taking a pixel as a unit until the local structure similarity index of each position is calculated, and then the average is taken as a global structure similarity index.
The loss function formed by the integration of the three indexes more comprehensively feeds back the quality of the model.
The improvement points in the implementation of the invention are as follows:
the loss function provided by the invention consists of three indexes for measuring the structural similarity of two images, namely the relative entropy, the significance of the standardized scanning path and the structural similarity of the images, the weight ratio of the three indexes is 1:1:1, the applicability of the model is enhanced, and the loss function comprehensively formed by the three indexes feeds back the quality of the model.
And 3) inputting the image data set to be detected into the significance detection model to obtain a significance image set.
And after the optimal model is obtained through the operation in the step two, a test set in a data set SALICON and a test set in a Massachusetts significance data set (MIT1003) are used as a test data set, and prediction is carried out on the trained model to obtain a significance image set. Fig. 3 is a schematic diagram showing a comparison between saliency maps of a test set and an MIT1003 test set in a data set saliconon provided by the implementation of the present invention obtained through algorithms, where an original image, a saliency map of a saliency attention model (SAM-VGG) based on an oxford university Visual Geometry Group (VGG) backbone network, a saliency map of a saliency attention model (SAM-ResNet) based on a residual error network (ResNet), a saliency map of the present invention, and a true thermodynamic diagram (GT) are sequentially shown from left to right, the first three original images in the original image are selected from the test sets in the data set saliconon, and the last original image is selected from the MIT1003 test set.
And 4) carrying out index evaluation on the obtained significance image set, wherein the index evaluation comprises relative entropy (KLD), normalized scanning path significance (NSS), Correlation Coefficient (CC) and area size under ROC curve (AUC-J).
Table 1 shows the AUC-J, KLD and CC comparison between the significance detection method based on the multilayer non-local network and other methods proposed by the invention.
TABLE 1
Figure BDA0002797702600000071
Figure BDA0002797702600000081
As can be seen from Table 1, the results of the method of the present invention are superior to those of other methods in the three indexes of AUC-J, KLD and CC, because the multi-layer non-local network combines global information and local information to capture significance characteristics, and combines a composite loss function to train a model, which has better comprehensiveness and universality.
Correspondingly, the invention also provides a visual saliency detection system based on the multilayer non-local network, which comprises:
the acquisition module is used for acquiring an image data set to be detected;
the processing module is used for inputting the image data set to be detected into a pre-trained significance detection model to obtain a significance image set; the pre-trained significance detection model is a multilayer non-local network model obtained by training with the minimum value of the loss function as a target.
The processing module includes a model building module for building a multi-layer non-local neural network based on the architecture of VGG 16.
The model building module is used for building an input layer, an encoder module, a multi-scale feature module, a decoder module and an output layer which are connected in sequence;
the encoder module comprises a first convolution unit, a second convolution unit, a third convolution unit, a fourth convolution unit and a fifth convolution unit which are sequentially connected; the first convolution unit and the second convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3 and a maximum pooling layer which are connected in sequence, and the third convolution unit, the fourth convolution unit and the fifth convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3, a maximum pooling layer and a non-local module layer which are connected in sequence;
the multi-scale feature module comprises a first convolution layer, a second expansion convolution layer, a third expansion convolution layer, a fourth expansion convolution layer, a mean value layer and a fifth convolution layer which are sequentially connected; the first convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the second expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 4, the third expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 8, the fourth expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 12, the fifth convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the input of the mean layer is the output of the fourth expansion convolutional layer, and under the condition that the input size is ensured to be unchanged, the input mean value is obtained and used as the input of the fifth convolutional layer;
the decoding module comprises a first deconvolution unit, a second deconvolution unit and a third deconvolution unit which are connected in sequence; the first deconvolution unit, the second deconvolution unit and the third deconvolution unit are all composed of an up-sampling layer and a convolution layer with convolution kernel of 3 x 3.
The processing module comprises a loss function determination module for determining a loss function as follows:
Figure BDA0002797702600000091
where N is the total number of images in the dataset, i represents the index of the ith image, KLi、NSSi、SSIMiRelative entropy, normalized scan path saliency, and structural similarity, respectively, for the ith image.
According to the method, a multilayer non-local network is constructed on the basis of the VGG16 network, and the receptive field of the model is enlarged. Compared with the repeatability of the circular operation, the long-distance dependency relationship can be directly captured by calculating the interaction between any two pixels by using the non-local module without being constrained by the position distance of the two pixels, so that the efficiency is improved and a better result is obtained; in the process of training the model, a loss function consisting of relative entropy, the significance of a standardized scanning path and structural similarity is adopted, so that the optimal model obtained by training has better comprehensiveness and universality.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (8)

1. A visual saliency detection method based on a multilayer non-local network is characterized in that,
acquiring an image data set to be detected;
inputting an image data set to be detected into a pre-trained significance detection model to obtain a significance image set; the pre-trained significance detection model is a multilayer non-local network model obtained by training with the minimum value of the loss function as a target.
2. The visual saliency detection method based on multilayer non-local networks according to claim 1 characterized in that said multilayer non-local network model is a multilayer non-local neural network based on architecture of VGG 16.
3. The multilayer non-local network based visual saliency detection method of claim 2 characterized in that said multilayer non-local neural network of VGG16 based architecture comprises an input layer, an encoder module, a multi-scale feature module, a decoder module and an output layer connected in sequence;
the encoder module comprises a first convolution unit, a second convolution unit, a third convolution unit, a fourth convolution unit and a fifth convolution unit which are connected in sequence; the first convolution unit and the second convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3 and a maximum pooling layer which are sequentially connected, and the third convolution unit, the fourth convolution unit and the fifth convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3, a maximum pooling layer and a non-local module layer which are sequentially connected;
the multi-scale feature module comprises a first convolution layer, a second expansion convolution layer, a third expansion convolution layer, a fourth expansion convolution layer, a mean value layer and a fifth convolution layer which are sequentially connected; the first convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the second expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 4, the third expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 8, the fourth expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 12, the fifth convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the input of the mean layer is the output of the fourth expansion convolutional layer, and under the condition that the input size is ensured to be unchanged, the input mean value is obtained and used as the input of the fifth convolutional layer;
the decoding module comprises a first deconvolution unit, a second deconvolution unit and a third deconvolution unit which are connected in sequence; the first deconvolution unit, the second deconvolution unit and the third deconvolution unit are all composed of an up-sampling layer and a convolution layer with convolution kernel of 3 x 3.
4. The method according to claim 1, wherein the loss function is:
Figure FDA0002797702590000021
where N is the total number of images in the dataset, i represents the index of the ith image, KLi、NSSi、SSIMiRelative entropy, normalized scan path saliency, and structural similarity, respectively, for the ith image.
5. A visual saliency detection system based on multilayer non-local networks, characterized by comprising:
the acquisition module is used for acquiring an image data set to be detected;
the processing module is used for inputting the image data set to be detected into a pre-trained significance detection model to obtain a significance image set; the pre-trained significance detection model is a multilayer non-local network model obtained by training with the minimum value of the loss function as a target.
6. The multi-layered non-local network-based visual saliency detection system of claim 5 characterized in that said processing module comprises a model construction module for constructing multi-layered non-local neural networks based on architecture of VGG 16.
7. The multilayer non-local network based visual saliency detection system of claim 6 characterized in that said model construction module is used to construct an input layer, an encoder module, a multi-scale feature module, a decoder module and an output layer connected in sequence;
the encoder module comprises a first convolution unit, a second convolution unit, a third convolution unit, a fourth convolution unit and a fifth convolution unit which are sequentially connected; the first convolution unit and the second convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3 and a maximum pooling layer which are sequentially connected, and the third convolution unit, the fourth convolution unit and the fifth convolution unit are respectively composed of a convolution layer with convolution kernels of 3 x 3, a maximum pooling layer and a non-local module layer which are sequentially connected;
the multi-scale feature module comprises a first convolution layer, a second expansion convolution layer, a third expansion convolution layer, a fourth expansion convolution layer, a mean value layer and a fifth convolution layer which are sequentially connected; the first convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the second expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 4, the third expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 8, the fourth expansion convolutional layer is a convolutional layer with a convolution kernel of 3 × 3 and an expansion rate of 12, the fifth convolutional layer is a convolutional layer with a convolution kernel of 1 × 1, the input of the mean layer is the output of the fourth expansion convolutional layer, and under the condition that the input size is ensured to be unchanged, the input mean value is obtained and used as the input of the fifth convolutional layer;
the decoding module comprises a first deconvolution unit, a second deconvolution unit and a third deconvolution unit which are connected in sequence; the first deconvolution unit, the second deconvolution unit and the third deconvolution unit are all composed of an up-sampling layer and a convolution layer with convolution kernel of 3 x 3.
8. The multi-layer non-local network based visual saliency detection system of claim 5 characterized in that said processing module comprises a loss function determination module for determining the following loss functions:
Figure FDA0002797702590000031
where N is the total number of images in the dataset, i represents the index of the ith image, KLi、NSSi、SSIMiRelative entropy, normalized scan path saliency, and structural similarity, respectively, for the ith image.
CN202011337545.3A 2020-11-25 2020-11-25 Visual saliency detection method and system based on multilayer non-local network Pending CN114612675A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011337545.3A CN114612675A (en) 2020-11-25 2020-11-25 Visual saliency detection method and system based on multilayer non-local network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011337545.3A CN114612675A (en) 2020-11-25 2020-11-25 Visual saliency detection method and system based on multilayer non-local network

Publications (1)

Publication Number Publication Date
CN114612675A true CN114612675A (en) 2022-06-10

Family

ID=81856738

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011337545.3A Pending CN114612675A (en) 2020-11-25 2020-11-25 Visual saliency detection method and system based on multilayer non-local network

Country Status (1)

Country Link
CN (1) CN114612675A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018028255A1 (en) * 2016-08-11 2018-02-15 深圳市未来媒体技术研究院 Image saliency detection method based on adversarial network
WO2019136591A1 (en) * 2018-01-09 2019-07-18 深圳大学 Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CN110555434A (en) * 2019-09-03 2019-12-10 浙江科技学院 method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN111340046A (en) * 2020-02-18 2020-06-26 上海理工大学 Visual saliency detection method based on feature pyramid network and channel attention

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018028255A1 (en) * 2016-08-11 2018-02-15 深圳市未来媒体技术研究院 Image saliency detection method based on adversarial network
WO2019136591A1 (en) * 2018-01-09 2019-07-18 深圳大学 Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CN110555434A (en) * 2019-09-03 2019-12-10 浙江科技学院 method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN111340046A (en) * 2020-02-18 2020-06-26 上海理工大学 Visual saliency detection method based on feature pyramid network and channel attention

Similar Documents

Publication Publication Date Title
CN111104898B (en) Image scene classification method and device based on target semantics and attention mechanism
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
US11798132B2 (en) Image inpainting method and apparatus, computer device, and storage medium
CN109241982B (en) Target detection method based on deep and shallow layer convolutional neural network
CN111047516A (en) Image processing method, image processing device, computer equipment and storage medium
CN112287931B (en) Scene text detection method and system
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
CN110689599A (en) 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement
CN113538246B (en) Remote sensing image super-resolution reconstruction method based on unsupervised multi-stage fusion network
CN113807361B (en) Neural network, target detection method, neural network training method and related products
CN110930378A (en) Emphysema image processing method and system based on low data demand
CN114332094A (en) Semantic segmentation method and device based on lightweight multi-scale information fusion network
CN113538233A (en) Super-resolution model compression and acceleration method based on self-distillation contrast learning
CN115908772A (en) Target detection method and system based on Transformer and fusion attention mechanism
CN116757955A (en) Multi-fusion comparison network based on full-dimensional dynamic convolution
CN113344110B (en) Fuzzy image classification method based on super-resolution reconstruction
CN111160100A (en) Lightweight depth model aerial photography vehicle detection method based on sample generation
Ren et al. A lightweight object detection network in low-light conditions based on depthwise separable pyramid network and attention mechanism on embedded platforms
Zhou et al. Ship detection based on multi-scale weighted fusion
CN116778346B (en) Pipeline identification method and system based on improved self-attention mechanism
CN116758610A (en) Attention mechanism and feature fusion-based light-weight human ear recognition method and system
CN114612675A (en) Visual saliency detection method and system based on multilayer non-local network
CN115527253A (en) Attention mechanism-based lightweight facial expression recognition method and system
CN115170803A (en) E-SOLO-based city street view example segmentation method
CN114782983A (en) Road scene pedestrian detection method based on improved feature pyramid and boundary loss

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination