CN113724261A

CN113724261A - Fast image composition method based on convolutional neural network

Info

Publication number: CN113724261A
Application number: CN202110920914.XA
Authority: CN
Inventors: 倪志彬; 何震宇; 梁淇奥; 蒋新科; 向芝莹; 周啸宇; 石爻; 李顺; 左健甫; 杨若辰; 吴世涵; 张恩华; 吉雪莲; 常世晴; 罗佳源; 陈攀宇; 王瑞锦
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-11-30

Abstract

The invention discloses a fast image composition method based on a convolutional neural network, which comprises the following steps: step 1: training a view evaluation model architecture based on a twin neural network: step 2: deploying the trained view evaluation model into a teacher model, and scoring the candidate image anchor frame; taking a score training view suggestion model of the teacher as a student model, and outputting a score ranking of the same anchor frame; and step 3: extracting multi-scale features through a target detection network; and 4, step 4: outputting the extracted features to an anchor frame through a neural network full-connection layer; and 5: and (4) cutting the original input picture according to the anchor frame obtained in the step (4) to obtain a new composition. The invention aims to train the model to find the view with good composition, has good robustness, can generate the processed view in a very short time, and can be widely applied to image cutting, image thumbnail, image repositioning and real-time viewing suggestions.

Description

Fast image composition method based on convolutional neural network

Technical Field

The invention relates to the field of image processing, in particular to a fast image composition method based on a convolutional neural network.

Background

Early cropping methods explicitly designed various manual features based on photographic knowledge (e.g., the trisection and center methods). With the development of deep learning, a great deal of researchers are dedicated to developing clipping methods in a data-driven manner, and the release of some reference data sets for comparison greatly facilitates the progress of related research.

However, obtaining the best candidate clip map is still extremely difficult, and is mainly influenced by the following three aspects: 1) the potential of image saliency information cannot be fully released. Previous saliency-based clipping methods focused on preserving the most important content in the best clip diagram, but ignored this: the saliency region and the best cropped picture overlap if the rectangle of the saliency region is located near the boundary of the source image. Moreover, the saliency information is only used for the generation of candidate clipping maps and is not continuously used in subsequent clipping modules. 2) The potential region pairs (region of interest (ROI) and region of discard (ROD)) and their internal laws are not well represented. In general, a pair-wise cropping method explicitly forms and feeds a pair of source images into an automated cropping model, but the performance of such methods is often poor due to the selection of a source image pair that is overly dependent on detail and uncertain. 3) Traditional indicators for evaluating clipping methods are unreliable and inaccurate. In some cases, the intersection ratio (IoU) and the Boundary Displacement Error (BDE) are not sufficient to subjectively evaluate the performance of its clipping method.

In the field of image processing technology, deep learning brings revolutionary changes to machine learning and makes significant improvements over a wide variety of complex tasks. In recent years, with the dramatic increase in image processing data volumes, many researchers have been working on training Deep Neural Networks (DNNs) in a distributed manner. Under distributed training, a data parallel Stochastic Gradient Descent (SGD) method is generally adopted for training, training examples are scattered on each worker, each worker trains gradients based on own data, all gradient update model parameters are aggregated in an all reduce or parameter server mode, and the updated parameters are sent back to all workers of the next iteration. Many applications benefit from training the model to find a view with a good composition, such as image cropping, image thumbnails, recommended viewing, and self-contained photography. Image cropping, which aims to find an image crop with the best aesthetic quality, is widely used in image post-processing, visual recommendation, and image selection as an important technique. Especially when a large number of images need to be cropped, image cropping becomes a laborious task. Thus, in recent years, automated image cropping has attracted increasing attention within the research community and industry.

Patent application CN202110400578.6 discloses a saliency sensing image cropping method, device, computing device and storage medium, which solve the problems of insufficient utilization of image saliency information and possible overfitting of a model in the prior art.

However, assigning anchor boxes to ground truth views based on an overlap measure makes it difficult to train a composition model, and slightly adjusting the views typically produces large differences in composition quality. Moreover, the annotations are not exhaustive and most anchor boxes will not be annotated.

Meanwhile, in the prior art, the present invention cannot assume that they are negative samples for the target detection scenario.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a fast image composition method based on a convolutional neural network.

The purpose of the invention is realized by the following technical scheme:

a fast image composition method based on a convolutional neural network comprises the following steps:

step 1: training a view evaluation model architecture based on a twin neural network:

step 2: deploying the trained view evaluation model into a teacher model, and scoring the candidate image anchor frame; taking a score training view suggestion model of the teacher as a student model, and outputting a score ranking of the same anchor frame;

and step 3: extracting multi-scale features through a target detection network;

and 4, step 4: outputting the extracted features to an anchor frame through a neural network full-connection layer;

and 5: and (4) cutting the original input picture according to the anchor frame obtained in the step (4) to obtain a new composition.

Further, the step 1 comprises the following substeps:

step 101: two sub-networks are adopted, each sub-network receives an input, maps the input to a high-dimensional feature space and outputs a corresponding representation;

step 102: the degree of similarity of the two inputs is compared by calculating the distance of the two tokens.

Further, the distance between the two characterizations is a euclidean distance.

Further, the loss function used to train the student model is:

wherein y represents the score output by the teachermodel, q represents the score output by the student model, and n represents the number of output scores.

Further, the loss function migrates the knowledge owned by the teacher model to the student model during the training phase, and the parameters of the student model are continuously optimized through back propagation.

Further, the convolution kernel size of the view evaluation model is 3 × 3; the structure adopts the alternative arrangement of the convolution layer and the pooling layer, and increases the number of layers of nonlinear transformation.

The invention has the beneficial effects that: the invention aims to train the model to find the view with good composition, has good robustness, can generate the processed view in a very short time, and can be widely applied to image cutting, image thumbnail, image repositioning and real-time viewing suggestions.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

In this embodiment, as shown in fig. 1, a fast image composition method based on a convolutional neural network includes the following steps:

and step 3: extracting multi-scale features through a target detection network;

and 5: and according to the obtained anchor frame, cutting the original input picture to obtain a new composition.

Further, the step 1 comprises the following substeps:

Further, the loss function used to train the student model is:

wherein y represents the score output by the teachermodel, q represents the score output by the studentmodel, and n represents the number of output scores.

The invention adopts a novel knowledge transfer framework to train a real-time view suggestion model based on an anchor frame. Namely, the knowledge learned by the teacher model is transferred to the student model, and the model parameters are small and the composition speed is high under the condition that the same effect is achieved as far as possible.

Unlike the target proposal network, the label assignment model of the view proposal of the present solution is very challenging. First, assigning anchor boxes to ground truth views based on an overlap measure makes it difficult to train a composition model, and slightly adjusting the views typically produces large differences in composition quality. Moreover, the annotations are not exhaustive and most anchor boxes will not be annotated. Meanwhile, for target detection scenarios, it cannot be assumed that they are negative samples.

A twin neural network based view evaluation model architecture is trained in the teachermodel: two sub-networks are used, each sub-network receiving an input, mapping it to a high-dimensional feature space, and outputting a corresponding representation. The degree of similarity of the two inputs is compared by calculating the distance of the two tokens, e.g. the euclidean distance. The invention then deploys this model as a teacher model to score candidate image anchor boxes, and the score training views of the teachers suggest that the network outputs the same anchor box score ranking as a student model. To train the student, the present invention proposes a Mean Pairwise Squared Error (MPSE) loss.

The loss function migrates the knowledge owned by the teacher model to the student model during the training phase, and the parameters of the student model are continuously optimized through back propagation.

In this embodiment, an extremely deep convolutional neural network for large-range image patterning is employed, which is a type of feed-forward neural network that contains convolution calculations and has a depth structure. Convolutional neural networks have a characteristic learning ability, and can perform translation invariant classification on input information according to a hierarchical structure thereof, and are also called "translation invariant artificial neural networks". Convolutional neural networks are standard hierarchical structures containing five main levels: the data exchange system comprises a data input layer, a convolution calculation layer, a ReLU excitation layer, a pooling layer and a full connection layer, wherein data are exchanged among the layers.

Because the standard convolutional neural network has the problems of gradient disappearance, gradient explosion and the like, the depth of the model is limited during training, and the image features cannot be well extracted.

Therefore, in the invention, the extremely deep convolutional neural network for composition of a large-range image based on the VGG framework is adopted, the convolutional kernel is replaced by the convolutional kernel with the size of 3x3, the structure that the convolutional layer and the pooling layer are arranged alternately is adopted, and the number of layers of nonlinear transformation is increased, so that the training parameters required by the model are greatly reduced, the model training and reasoning speed is improved, and the generalization is enhanced.

The model is applied to Scissorhands, and tests show that the model has enhanced composition quality and improved cutting speed. The VGG framework is based on a standard convolutional neural network, 3x3 convolutional kernels are used for replacing a 7x7 convolutional kernel, and 2 x3 convolutional kernels are used for replacing a 5 x 5 convolutional kernel.

The model architecture and parameters are shown in table 1.

TABLE 1 model architecture and parameter table

Table 2：Number of parameters(in millions).

Network	A，A-LRN	B	C	D	E
						Number of parameters	133	133	134	138	144

A convolutional network based SSD employs a fixed size set of bounding boxes and scores for the existence of object classes. The instances in these blocks are followed by a non-maximum suppression step to generate the final detection. Early network architectures generated high quality image classification based on a standard architecture (truncated before any classification layer), and the present invention would call the underlying network and then add auxiliary structures in the network to generate detections with critical features. SSD networks can also improve the patterning quality of scissorhands, mainly because of an important idea of the network: the characteristic pyramid detects the target on a plurality of scales to improve the detection precision, namely (1) the higher the characteristic layer is, the richer the semantic information is, the different characteristic layers represent the characteristic utilization of different levels, and the detection result is better than the detection effect only on the last layer; (2) the characteristic layers are from low to high, the receptive field is from small to large, and different characteristic layers are helpful for detecting targets with different sizes. The SSD network is additionally provided with a plurality of feature layers on the basis of a VGG16 basic layer, FC7 in VGG16 is changed into a convolutional layer Conv7, and Conv8, Conv9, Conv10 and Conv11 feature layers are added.

In this embodiment, taking an image as an example, the specific process of performing image cropping is as follows:

(1) input image data normalization: the vectors of the three channels of the input picture are converted into vectors with the average values of 0.486, 0.456, 0 and 406 and the standard deviations of 0.229, 0.224 and 0.225 respectively,

(2) calculating the vertex value of the frame, acquiring the subimage through a preset anchor frame, calculating the vertex value of the frame, and storing the acquired subimage and the frame value.

(3) Acquiring image and model parameters: and acquiring a path of the picture to be used and acquiring parameters of the training model.

(4) Data enhancement: the method comprises the steps of converting a PIL picture into a numpy array type, changing the size of the picture, disordering the sequence of the picture, randomly changing the gray value of the picture, adding Gaussian noise to the picture, or distorting the picture, and the like, enhancing the data of the picture, and enhancing the robustness of a training model through data enhancement. The various data enhancement methods can be realized by the encapsulation and calling of classes (objects), and the data enhancement functions are combined and encapsulated by the classes to be called, so that a training sample set is increased.

(5) And (3) visualizing the cut image: acquiring a predefined frame anchor, and acquiring the clipped picture image _ crops and the position bboxes of the clipped picture relative to the original picture.

The pre-training parameters of the network are called VGG by using the classic VGG network, and the rest of the architecture is the same as that of the VGG network, and the parameters are activated and updated by the last full-connection layer during training. Twin network: a classical twin network architecture is applied here.

Training of the model: and calling a GPU training model, and calling a plurality of GPUs for parallel training.

(6) Check if the data is valid: checking whether the input data has nan, infinity, and all 0, checking whether the batch normalization is effective, and obtaining the results of the batch normalization of all the input data (the batch normalization is used for enhancing the training depth of the model and improving the robustness of the model).

(7) Saving the training model: the model parameters of the current state, all the parameters of the model and the parameters of the best N models are respectively saved.

(8) Generating an image cropping annotation (i.e., generating an anchor frame): converting the picture vector into a torch.FloatTensor format for calling a Pythroch library to accelerate deep learning operation; returning a batch of training data to train in parallel; and returning to the picture shape, dynamically updating the learning rate, setting the learning rate and calculating the average accuracy.

A) Creating an output file;

B) generating a cutting label and saving (. txt format);

C) cut data save (json format).

The invention aims to train the model to find the view with good composition, has good robustness, can generate the processed view in a very short time, and can be widely applied to image cutting, image thumbnail, image repositioning and real-time viewing suggestions.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and elements referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a ROM, a RAM, etc.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A fast image composition method based on a convolution neural network is characterized by comprising the following steps:

and step 3: extracting multi-scale features through a target detection network;

2. The convolutional neural network-based fast image composition method as claimed in claim 1, wherein said step 1 comprises the following sub-steps:

3. The convolutional neural network-based fast image patterning method as claimed in claim 2, wherein the distance between the two features is Euclidean distance.

4. The convolutional neural network-based fast image composition method as claimed in claim 1, wherein the loss function used for training the student model is:

5. The convolutional neural network-based fast image composition method as claimed in claim 4, wherein the loss function migrates the knowledge owned by the teacher model to the student model in the training phase, and the parameters of the student model are continuously optimized by back propagation.

6. The convolutional neural network-based fast image patterning method as claimed in claim 1, wherein the convolution kernel size of the view evaluation model is 3x 3; the structure adopts the alternative arrangement of the convolution layer and the pooling layer, and increases the number of layers of nonlinear transformation.