CN111340088A

CN111340088A - Image feature training method, model, device and computer storage medium

Info

Publication number: CN111340088A
Application number: CN202010107584.8A
Authority: CN
Inventors: 商琦; 杜梓平; 张莉
Original assignee: Suzhou Industrial Park Institute of Services Outsourcing
Current assignee: Suzhou Industrial Park Institute of Services Outsourcing
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-06-26

Abstract

The invention provides an image feature training method, a model, a device and a computer storage medium for deep learning in the field of artificial intelligence, wherein the image feature training method comprises the following steps: receiving a feature map of an image to be feature-trained as input of a first feature training layer; setting the number of layers of the characteristic training layer; the input of the third feature training layer and each subsequent feature training layer is determined by the output of at least part of non-adjacent preorders of the feature training layers, and the number of each subsequent feature training layer is related to the number of layers of the set feature training layers; performing at least one convolution operation on the input of each of the feature training layers to determine the output of the feature training layer. The characteristic training method provided by the invention fuses and superposes at least part of the output of the preorder characteristic training layer as the input of the current characteristic training layer so as to make up the characteristic loss during the characteristic training; the image feature extraction quality and the prediction effect in the image feature training method can be maintained at a high level by setting the number of feature training layers.

Description

Image feature training method, model, device and computer storage medium

Technical Field

The invention relates to the technical field of image feature training; and more particularly, to an image feature training method, model, apparatus, and computer storage medium.

Background

The image feature training is one of extremely important research directions of machine learning or deep learning in the field of computer vision, is an important link of feature training, is widely applied to a neural network-based derivative network model, and typical derivative networks comprise an Alex-Net network, a VGG-Net network, a Le-Net network, a Google-Net network and the like. The more common technical steps in the image feature training process include convolution operation, pooling operation and full-concatenation operation, wherein the convolution operation and the pooling operation can be performed multiple times, repeatedly and in combination. For example, the original image is subjected to convolution, pooling, convolution and pooling operations in sequence, and then full-connection operation is performed; or performing full connection operation after convolution, pooling, convolution and pooling operation in sequence to learn the characteristic information in the original image.

In the existing image feature training, image features are extracted based on a multi-level feature training layer, the input of the current feature training layer is the output of the previous feature training layer, and the input of the next feature training layer is the output of the current feature training layer. The design defect that the current feature training layer only interacts with the adjacent feature training layer inevitably loses at least part of feature information of other feature training layers except the adjacent feature training layer, and the lost local feature information cannot be compensated in the subsequent feature training layer, so that the image feature training effect is poor.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the art described above. For this purpose,

the first purpose of the invention is to provide an image characteristic training method, which comprises the steps of fusing and adding at least part of outputs of all characteristic training layers of a preamble to be used as the input of a current characteristic training layer so as to make up the characteristic learning loss during characteristic training; the number of layers of the image feature training layer is reasonably set, so that the image feature extraction quality and the prediction effect are maintained at a better level.

The second purpose of the invention is to provide an image characteristic training device, which is mainly characterized in that an input determining module and an output determining module are arranged, and at least part of outputs of all the characteristic training layers in the preamble are fused and added to be used as the input of the current characteristic training layer, so as to make up the characteristic learning loss during the characteristic training; the number of layers of the image feature training layer is reasonably set, so that the image feature extraction quality and the prediction effect are maintained at a better level.

The third objective of the present invention is to provide an image feature training model, which sequentially performs at least one set of convolution and pooling operations on the output of the last layer of feature training layer determined by the image feature training method provided by the first objective of the present invention to obtain the image feature training model, thereby improving the image feature extraction quality and the prediction effect.

A fourth objective of the present invention is to provide a computer storage medium, wherein at least part of the output fusion sum of the pre-order feature training layers is used as the input of the current feature training layer, and a computer instruction is stored in a computer, and when the instruction is executed, the loss of feature training during feature training can be compensated, so as to improve the image feature extraction quality and the prediction effect.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides an image feature training method, where the method includes: receiving a feature map of an image to be feature-trained as input of a first feature training layer; setting the number of layers of the characteristic training layer; the input of the third feature training layer and each subsequent feature training layer is determined by the output of at least part of non-adjacent preorders of the feature training layers, and the number of each subsequent feature training layer is related to the number of layers of the set feature training layers; performing at least one convolution operation on the input of each of the feature training layers to determine the output of the feature training layer.

Preferably, each of the feature training layers is connected in sequence.

Preferably, the feature map is determined at least based on a preconfigured convolution kernel and a sliding window interval, and the feature map contains local features of the training image to be characterized.

Preferably, the type of the convolution kernel is determined at least according to the characteristic shape of the training image to be characterized.

Preferably, the convolution kernel is a deformable convolution kernel including offset matrix parameters, which participate in image feature learning and can be trained and updated.

Preferably, the sliding window interval is determined according to a learning strategy comprising: setting a larger convolution kernel sliding window interval in a region close to the edge of the image; a smaller convolution kernel sliding window interval is set away from the image edge region.

Preferably, the determining the feature map based on at least the convolution kernel and the sliding window interval comprises: and performing convolution operation on the image by taking the sliding window interval as a step size by using the convolution kernel to obtain the feature map.

Preferably, the number of layers of the set feature training layer is an integer greater than or equal to 2 and less than or equal to 15, so as to improve the effect of the image feature training.

Preferably, when the number of layers of the set feature training layer is 5, the effect of the image feature training is optimal.

Preferably, the determination of the input of the third feature training layer and each subsequent feature training layer from the output of at least part of non-adjacent preambles by the feature training layers comprises: setting weights of the feature training layers according to the influence degree of the feature training layers on the subsequent feature training layers; correspondingly giving the weight of each characteristic training layer to the output of each characteristic training layer which is not adjacent to the preorder to determine the effective output of each characteristic training layer to the current characteristic training layer; and adding the effective output of each feature training layer to the current feature training layer with the output of the last feature training layer, and updating the input of the current feature training layer according to the addition result.

Preferably, the feature training layer weights participate in the image feature learning and can be trained and updated.

In order to achieve the above object, a second aspect of the present invention provides an image feature training apparatus, comprising: the characteristic diagram receiving module is used for receiving a characteristic diagram of an image to be subjected to characteristic training as input of a first characteristic training layer; the layer number setting module is used for setting the layer number of the characteristic training layer; an input determining module, configured to determine, at least from outputs of a part of the pre-order feature training layers, inputs of the third feature training layer and subsequent feature training layers, where the number of the subsequent feature training layers is associated with the number-of-layers setting module; and the output determining module is used for at least performing convolution operation on the input of each characteristic training layer to determine the output of the characteristic training layer.

Preferably, the input of the second layer feature training layer is the output of the first layer feature training layer.

Preferably, the convolution kernel determining module determines the convolution kernel at least according to the shape of the image feature to be learned.

Preferably, the convolution kernel is a deformable convolution kernel including offset matrix parameters, which participate in the image feature learning and can be trained and updated.

Preferably, the sliding window interval determination module determines the sliding window interval according to a learning strategy, wherein the learning strategy comprises: setting a larger convolution kernel sliding window interval in a region close to the edge of the image; a smaller convolution kernel sliding window interval is set away from the image edge region.

Preferably, the feature map obtaining module performs a convolution operation on the image with the convolution kernel in steps of the sliding window interval to obtain the feature map

Preferably, the input determination module comprises: the weight setting module is used for setting the weight of each characteristic training layer according to the influence degree of each characteristic training layer on each subsequent characteristic training layer; an effective output determining module, configured to correspondingly assign weights of the feature training layers to outputs of pre-order non-adjacent feature training layers to determine effective outputs of the feature training layers to a current feature training layer; and the input adding module is used for adding the effective output of each characteristic training layer to the current characteristic training layer with the output of the previous characteristic training layer and updating the input of the current characteristic training layer according to the addition result.

Preferably, the weights participate in the image feature learning and may be trained and updated.

In order to achieve the above object, a third embodiment of the present invention provides an image feature training model, where at least one set of convolution and pooling operations is sequentially performed on the output of the last feature training layer determined by using any one of the above image feature training methods, so as to obtain the image feature training model.

In order to achieve the above object, a fourth aspect of the present invention provides a computer storage medium for storing computer readable instructions, which when executed by the computer, cause the computer to execute any one of the image feature training methods described above.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

The invention is further described with reference to the following figures and examples.

Drawings

Fig. 1 is a basic flow chart of an image feature training method according to an embodiment of the present invention.

Fig. 2 is one of basic functional block diagrams of an image feature training method according to an embodiment of the present invention.

Fig. 3 is a second basic schematic block diagram of the image feature training method according to the embodiment of the invention.

FIG. 4 is a diagram illustrating a deformable convolution according to an embodiment of the present invention.

Fig. 5 is a block diagram of an image feature training apparatus according to an embodiment of the present invention.

FIG. 6 is a block diagram of a process of training a model for image features according to an embodiment of the present invention.

FIG. 7 is a basic block diagram of an image feature training model according to an embodiment of the present invention.

Detailed Description

Example one

The convolutional neural network model has good learning effect in the aspects of extraction, training and prediction of image features, and good image prediction accuracy can be achieved through training of a large number of specific data sets. In practical applications of the convolutional neural network model, higher-order features obtained after each convolutional layer performs convolution operation on an input feature map usually cannot guarantee feature integrity, in other words, the image features extracted through the convolution operation may have feature loss phenomena of different degrees, so that the extracted high-order features are incomplete and distorted, and have a large difference from the corresponding features in the original image, and finally, the prediction result of the model is large in difference from the expected result, and the accuracy is low.

The convolutional neural network is one of representative neural networks in the technical field of deep learning, and makes a lot of breakthrough progress in the field of image analysis and processing, for example, ImageNet is a relatively common standard image data label set. The convolutional neural network is mainly used for technical scenes such as image feature extraction and classification, scene recognition and the like, and compared with the traditional image processing algorithm, the convolutional neural network has the advantages that the complex early-stage preprocessing process of the image is avoided, particularly, the manual image preprocessing process is avoided, and the target image can be predicted with higher precision only by fully carrying out supervised or unsupervised learning on the specified image data label set.

With reference to fig. 1 to 3, a detailed description is given of an image feature training method provided by an embodiment of the present invention, which includes the following steps:

s1, receiving a feature map of the image to be feature-trained as an input of a first feature training layer;

the initial input data received by the image feature training model is a feature map obtained after convolution operation of an image to be feature trained, and it should be understood that an original image cannot be directly used as input of a first feature training layer of the image feature training model. The input of the first feature training layer is a feature map, and the premise is that the image to be feature learned needs to be initialized, wherein the initialization at least comprises one of image size adjustment, image channel number change, image filtering and image interpolation. And performing convolution operation on the initialized image to obtain a feature map corresponding to the image, wherein the feature map is used as the input of the first feature training layer.

The received image to be subjected to feature training can be a color multi-channel image, such as a three-channel RGB image, or a single-channel gray image. The feature training model has a limit on the size of the received image to be characterized, such as 26 pixels by 26 pixels and 28 pixels by 28 pixels, and if the size of the received image to be characterized is not consistent with the size required by the model, after the image to be characterized is received, the size of the image is calibrated, so that the size of the calibrated image is consistent with the size required by the model, and the subsequent processing is facilitated.

S2, setting the number of layers of the feature training layer;

the image feature training method or process, necessarily performs training and prediction operations in a limited number of feature training layers.

For example, the convolutional neural network model is as follows:

a first layer of feature training layer: receiving a characteristic diagram;

a second layer of feature training layer: performing a convolution operation on the feature map in the first layer of feature training layer;

the third layer of characteristic training layer: performing pooling operation on convolution results in the second layer of feature training layer;

a fourth layer of feature training layer: performing a second convolution operation on the pooled results in the third layer of feature training layer;

a fifth characteristic training layer: performing secondary pooling operation on the secondary convolution results in the fourth layer of feature training layer;

a sixth layer of feature training layer: and performing full-connection operation on the secondary pooling result in the fifth layer of feature training.

As in the above example, the feature training method includes six layers, that is, the number of image feature training layers is 6, and so on.

It should be understood that setting the layer number value of the feature training layer in a reasonable interval range can make the feature training and prediction effects of the image better, and make the prediction accuracy higher under the condition of not increasing much training resources; when the layer number takes a specific value, the prediction precision can reach the optimal level of the same type of model.

Table 1 below shows the mapping relationship between the number of feature training layers and the average prediction accuracy, the size of the convolution kernel used is 3 × 3, and the prediction object is a handwritten number. It can be seen that the layer number design of the feature training layer is directly related to the final average prediction accuracy, and when the layer number value of the feature training layer is set to be an integer greater than or equal to 2 and less than or equal to 15, more than 85% of prediction accuracy can be obtained, and the prediction effect is good. When the number of the feature training layers is set to be 5, the optimal average prediction precision can be obtained. When the number of the feature training layer is set to be more than or equal to 16, an obvious overfitting phenomenon is generated, and the algorithm is too complex due to too large training parameters, so that the prediction accuracy and the prediction efficiency are finally influenced.

TABLE 1 mapping relationship between number of feature training layers and average prediction accuracy

Number of feature training layers	Accuracy of prediction	Number of feature training layers	Accuracy of prediction	Number of feature training layers	Accuracy of prediction
						2	86.31％	7	89.64％	12	88.29％
3	86.87％	8	89.20％	13	87.24％
						4	89.22％	9	89.08％	14	86.06％
5	92.06％	10	88.97％	15	85.65％
						6	90.73％	11	88.63％	16	82.75％

It should be understood that the number of layers of the feature training layer is set to be associated with the prediction accuracy and the prediction efficiency of the image feature training method or the training model, and setting a better value range or an optimal value in the deep learning technical practice can reduce the parameter quantity to be learned in the training process on the premise of ensuring the prediction accuracy, so that the prediction efficiency is further improved, and the method is a very important and critical step in the training process.

S3, wherein the input of the third feature training layer and each subsequent feature training layer is determined by the output of at least part of non-adjacent preorders of the feature training layers, and the number of each subsequent feature training layer is related to the number of layers of the set feature training layers;

after the number of layers of the feature training layer is determined, a training process can be performed. The training network on which the image feature training method is based consists of a first feature training layer, a second feature training layer, … … and an Nth feature training layer, where N is understood to be the number of feature training layers set in step S2. Assuming that the number of feature training layers set in step S2 is an optimal value of 5, the training method or model totally involves 5 feature training layers connected in sequence, which are a first feature training layer, a second feature training layer, a third feature training layer, a fourth feature training layer, and a fifth feature training layer. Each characteristic training layer maintains different degrees of connection relation with at least other characteristic training layers, and the connection relation is understood to mean that the output of the preamble characteristic training layer is connected with the input of other characteristic training layers which are associated with the output of the preamble characteristic training layer and have data transmission with each other, namely the preamble characteristic training layer and other characteristic training layers maintain data channels so as to complete different degrees of data transmission.

The inputs of the third and subsequent feature training layers are determined by at least the outputs of some non-adjacent preambles of the feature training layer, it being understood that the inputs of the third and subsequent feature training layers are no longer only connected to the feature training layer adjacent to the preamble of the current feature training layer, as described in the prior art, but also remain connected to at least some preambles of the feature training layer non-adjacent to the current feature training layer.

Assuming that the number of feature training layers is 5, fig. 2 exemplarily shows a fully connected scene corresponding to the image feature training method, and exemplarily, a block 1 indicates a first feature training layer and its internal processing, and so on:

the input of a third feature training layer is determined by the output of a first feature training layer and the output of a second feature training layer, the first feature training layer is not adjacent to the third feature training layer, and the second feature training layer is adjacent to the third feature training layer; the first feature training layer and the second feature training layer are both pre-order feature training layers of the third feature training layer, and the fourth feature training layer and the fifth feature training layer are both subsequent feature training layers of the third feature training layer.

The input of a fourth feature training layer is determined by the output of a first feature training layer, the output of a second feature training layer and the output of a third feature training layer, the first feature training layer and the second feature training layer are not adjacent to the fourth feature training layer, and the third feature training layer is adjacent to the fourth feature training layer; the first feature training layer, the second feature training layer and the third feature training layer are all preorder feature training layers of the fourth feature training layer, and the fifth feature training layer is a subsequent feature training layer of the fourth feature training layer.

The input of a fifth feature training layer is determined by the output of a first feature training layer, the output of a second feature training layer, the output of a third feature training layer and the output of a fourth feature training layer, wherein the first feature training layer, the second feature training layer and the third feature training layer are all not adjacent to the fifth feature training layer, and the fourth feature training layer is adjacent to the fifth feature training layer; the first feature training layer, the second feature training layer, the third feature training layer and the fourth feature training layer are all preorder feature training layers of the fifth feature training layer, the fifth feature training layer is the last feature training layer in the network architecture or model, and no subsequent feature training layer is provided.

It is easy to understand that fig. 2 exemplarily shows a fully-connected network architecture corresponding to the image feature training method, and for a partially-connected non-fully-connected network architecture, the image feature training method and the corresponding steps thereof are also applicable. The network architecture in which "partially connected" is no longer completely "fully connected" as shown in fig. 2.

The "partial join" of the fourth feature training layer is exemplified as follows:

the input of the fourth feature training layer is determined by the output of the second feature training layer and the output of the third feature training layer together, but is not directly determined by the output of the first feature training layer; or the like, or, alternatively,

the input of the fourth feature training layer is determined by the output of the first feature training layer and the output of the third feature training layer together, and is not directly determined by the output of the second feature training layer.

The "partial join" of the fifth feature training layer is exemplified as follows:

the input of the fifth characteristic training layer is determined by the output of the first characteristic training layer and the output of the fourth characteristic training layer; or the input of the fifth characteristic training layer is determined by the output of the second characteristic training layer and the output of the fourth characteristic training layer; or the input of the fifth characteristic training layer is determined by the output of the third characteristic training layer and the output of the fourth characteristic training layer; or the like, or, alternatively,

the input of the fifth characteristic training layer is determined by the output of the first characteristic training layer, the output of the second characteristic training layer and the output of the fourth characteristic training layer; or the like, or, alternatively,

the input of the fifth characteristic training layer is determined by the output of the first characteristic training layer, the output of the third characteristic training layer and the output of the fourth characteristic training layer; or the like, or, alternatively,

the input of the fifth feature training layer is determined by the output of the second feature training layer, the output of the third feature training layer and the output of the fourth feature training layer.

The number of each subsequent feature training layer is related to the number of the set feature training layers, and it should be understood that the number of each subsequent feature training layer is controlled by the number of the set feature training layers, and if the number of the feature training layers is 5 and the current feature training layer is the third feature training layer, the subsequent feature training layers are the fourth feature training layer and the fifth feature training layer; if the number of the set experience layers of the feature training layer is 7, the subsequent feature training layers are a fourth feature training layer, a fifth feature training layer, a sixth feature training layer and a seventh feature training layer. And each subsequent characteristic training layer of the current characteristic training layer starts from the next characteristic training layer of the current characteristic training layer, and sequentially goes backwards until the last characteristic training layer.

Assuming that the position of the current feature training layer is the mth layer in the network architecture and the number of the feature training layers is n, the subsequent feature training layers of the current feature training layer are respectively the (m + 1) th layer, the (m + 2) th layer, … … and the nth layer (the last layer) in the network architecture.

Further, the determination of the input of the third feature training layer and each subsequent feature training layer from the output of at least part of non-adjacent preambles by the feature training layers comprises:

s301, setting weights of the feature training layers according to the influence degree of the feature training layers on the subsequent feature training layers;

for example, 5 layers of the feature training layer are set, see fig. 3, and so on.

The image characteristic training model comprises a layer 1 characteristic training layer L₁Layer 2 feature training layer L₂Layer 3 feature training layer L₃Layer 4 feature training layer L₄And a 5 th feature training layer L₅。L₁、L₂、……、L_nEach layer of (a) includes both data inputs and data outputs.

In order to reflect the influence degree of the image features extracted by different feature training layers on subsequent high-order features, a special weight matrix is respectively given to each layer of feature training layer:

L₁to L₂Is set as a₁₂，L₁To L₃Is set as a₁₃，L₁To L₄Is set as a₁₄，L₁To L₅Is set as a₁₅；

L₂To L₃Is set as a₂₃，L₂To L₄Is set as a₂₄，L₂To L₅Is set as a₂₅；

L₃To L₄Is set as a₃₄，L₃To L₅Is set as a₃₅；

L₄To L₅Is set as a₄₅。

Wherein, a₁₂，a₁₃，……，a₄₅Has a value range of [0,1 ]]Including the boundary.

S302, correspondingly giving the weight of each feature training layer to the output of each feature training layer which is not adjacent to the preorder to determine the effective output of each feature training layer;

mixing L with₁To L₃Weight of a₁₃To give L₁Output o (L)₁) To obtain o (L)₁)*a₁₃；

Mixing L with₁To L₄Weight of a₁₄To give L₁Output o (L)₁) To obtain o (L)₁)*a₁₄；

Mixing L with₁To L₅Weight of a₁₅To give L₁Output o (L)₁) To obtain o (L)₁)*a₁₅；

Mixing L with₂To L₄Weight of a₂₄To give L₂Output o (L)₂) To obtain o (L)₂)*a₂₄；

Mixing L with₂To L₅Weight of a₂₅To give L₂Output o (L)₂) To obtain o (L)₂)*a₂₅；

Mixing L with₃To L₅Weight of a₃₅To give L₃Output o (L)₃) To obtain o (L)₃)*a₃₅；

In other words,

L₃weight element a of₁₃Only from L₁Compensation of (2);

L₄of the weight matrix element a₁₄From L₁Compensation of a₂₄From L₂Compensation of (2);

L₅of the weight matrix element a₁₅From L₁Compensation of a₂₅From L₂Compensation of a₃₅From L₃Compensation of (2).

Therefore, the temperature of the molten metal is controlled,

mixing L with₃Weight element a of₁₃Giving a training layer L to non-adjacent features of a preamble₁Output o (L)₁) To obtain L₁To L₃Effective output o (L)₁)*a₁₃。

Mixing L with₄Weight element a of₁₄Correspondingly endowing the preamble non-adjacent characteristic training layer L₁Output o (L)₁) To obtain L₁To L₄Effective delivery ofGo out o (L)₁)*a₁₄；a₂₄Correspondingly endowing the preamble non-adjacent characteristic training layer L₂Output o (L)₂) To obtain L₂To L₄Effective output o (L)₂)*a₂₄。

Mixing L with₅Weight element a of₁₅Correspondingly endowing the preamble non-adjacent characteristic training layer L₁Output o (L)₁) To obtain L₁To L₅Effective output o (L)₁)*a₁₅；a₂₅Correspondingly endowing the preamble non-adjacent characteristic training layer L₂Output o (L)₂) To obtain L₂To L₅Effective output o (L)₂)*a₂₅；a₃₅Correspondingly endowing the preamble non-adjacent characteristic training layer L₃Output o (L)₃) To obtain L₃To L₅Effective output o (L)₃)*a₃₅。

And S303, adding the effective output of each feature training layer to the current feature training layer with the output of the previous feature training layer, and updating the input of the current feature training layer according to the addition result.

The addition, namely the superposition and fusion of multiple data, is understood as logical summation operation, and the semantic meaning is understood as the addition operation of all effective outputs of each layer so as to realize the superposition of multiple data.

L₃Input i (L)₃) The result of addition is o (L)₂)+o(L₁)*a₁₃；

L₄Input i (L)₄) The result of addition is o (L)₃)+o(L₁)*a₁₄+o(L₂)*a₂₄；

L₅Input i (L)₅) The result of addition is o (L)₄)+o(L₁)*a₁₅+o(L₂)*a₂₅+o(L₃)*a₃₅；

Wherein i (L)₃) Represents L₃Input of i (L)₄) Represents L₄Input of i (L)₅) Represents L₅And so on.

It should be noted that, in the following description,when a is₁₂，a₁₃，……，a₄₅When any one of the boundary values is 0, it indicates that the feature training layer corresponding to the weight has no influence on the subsequent feature training layers, and in this case, the input of each subsequent feature training layer is determined by the output of the feature training layer in part of the preambles. It should be understood that the input to subsequent feature training layers does not necessarily contain the output of all preceding feature training layers. When a is₁₂，a₁₃，……，a₄₅When any one of the boundary values is 1, the characteristic training layer corresponding to the weight directly influences the subsequent characteristic training layer.

And S4, performing convolution operation at least once on the input of each feature training layer to determine the output of the feature training layer.

Referring to fig. 2, the digital boxes represent feature training layers, each of which performs a convolution operation at least once on input contents and takes the result of the convolution operation as the output of the feature training layer. The convolution operation is mainly used for extracting characteristic information such as texture, shape and the like of an image in the deep learning field, and different types of local characteristic information can be extracted by different convolution operations. It should be understood that the convolution operation in step S4 can not only extract the feature data in the image, but also obtain the local feature information loss caused by transmission through multiple feature training layers, obtain different degrees of feature loss compensation, and thus make the training and prediction more effective.

L₁The output data stream of (a) can be represented by the following formula:

o(L₁)＝f(i(L₁))；

o(L₂)＝f(i(L₂))；

by the way of analogy, the method can be used,

o(L_n)＝f(i(L_n))。

where f () denotes performing the convolution operation at least once.

The feature training layer can execute a convolution operation and then execute nonlinear processing after the convolution operation, the nonlinear processing can overcome the defect of image linear classification, so that model prediction is closer to an actual scene, and the nonlinear processing at least comprises one of ReLU function processing, sigmoid function processing and tanh function processing.

Further, the convolution operation step should at least comprise two sub-steps of determining a convolution kernel and a sliding window interval, and determining a convolution result.

S401, determining a convolution kernel and a sliding window interval;

convolution kernels are essential key elements for convolution operations, and are typically odd-order, square-sized, such as 1 pixel by 1 pixel, 3 pixels by 3 pixels, 5 pixels by 5 pixels. Further, the convolution kernel performs a convolution operation using a kernel of 3 pixels by 3 pixels size in the learning image to be characterized in step S1. The convolution kernel participates in operation in a matrix form, and the convolution kernel matrix shares a weight. The convolution kernel matrix is involved in training and learning, so the initial value of the convolution kernel matrix can take any value or an empirical value.

Performing the convolution operation, in addition to requiring the determination of the convolution kernel, also necessitates the determination of the sliding window interval. The sliding window interval is used as the sliding step length of the convolution kernel, and the discontinuity of the sliding of the convolution kernel on the time domain can be ensured. It should be understood that the larger the sliding window interval is, the smaller the model training calculation amount is, but the lower the feature learning precision is; the smaller the sliding window interval is, the larger the model training calculation amount is, but the higher the feature learning precision is. Therefore, determining the sliding window interval requires considering the balance of the amount of model training computation and learning accuracy. Further, step is 1, 2 or 3 is adopted for the sliding window interval.

S402, determining a convolution result based on the convolution kernel and the sliding window interval;

a convolution result, that is, an output result of the input data of the current Feature training layer, may be determined according to the convolution kernel and the sliding window interval determined in step S401, where the output result is a Feature Map (Feature Map). And performing convolution operation on the input data and the sliding window interval which is determined by the convolution kernel and is the step length, wherein the number of the feature maps is related to the number of image channels and the number of the convolution kernels, and if the number of the image channels is 3 or the number of the convolution kernels is 3, the number of the feature maps is also 3.

Example two

In step S1 described in the first embodiment, the receiving of the feature map of the training image to be characterized as the input of the first feature training layer may be further optimized to "determine the feature map based on at least a pre-configured convolution kernel and a sliding window interval, where the feature map includes local features of the training image to be characterized", and the remaining steps are kept unchanged.

S101-1, determining a convolution kernel and a sliding window interval;

the feature map is determined based on at least preconfigured convolution kernel and sliding window interval parameters, which may be preconfigured with empirical or experimental values.

For image feature learning and prediction in a specific scene or a specific field, in order to improve the final prediction accuracy of the model, a convolution kernel is determined according to the shape attribute of the image feature to be learned. It will be appreciated that different convolution kernels are selected for different shaped image features. For example, in military, ship images are often required to be recognized and predicted, or in industrial production lines, mechanical parts in a certain specific shape are often required to be trained and predicted, and for different application scenarios, convolution kernels should show differences.

Further, referring to fig. 4, fig. 4(a) shows 3 × 3 normal (non-deformable) convolution kernels, and fig. 4(b) shows 3 × 3 deformable convolution kernels, and it is obvious that, in fig. 4(b), each element of the convolution kernels is shifted on the basis of fig. 4(a), so that the convolution kernels after the shift (e.g., fig. 4(b)) can better adapt to a specific image feature, for example, the image feature to be learned is mainly a strip insect, and then the elements of the deformable convolution kernels can be shifted to be similar to a strip shape, so that the extraction efficiency of the strip insect feature is higher, and the effect is better.

The convolution kernel adopts a deformable convolution kernel, the order is kept unchanged on the basis of convolution kernels with odd orders and square sizes, and the elements of the convolution kernel are subjected to azimuth offset to form offset matrix parameters. And the offset matrix parameters participate in the image feature learning and training process and are updated and optimized.

The following table shows the prediction errors after training on the same 32 pixel by 32 pixel gray-scale map respectively on the data sets of CIFAR, ImageNet, SVHN, etc. when the image feature training method uses 3 x 3 size convolution kernel, 3 x 3 deformable size convolution kernel, 5 x 5 deformable size convolution kernel, and the sliding window interval is 2. It can be seen that the 3 x 3 deformable convolution determined from the shape of the image feature to be learned predicts much more accurately on 3 data sets than the 3 x 3 ordinary square convolution kernel, and the 5 x 5 deformable convolution kernel also predicts more accurately than the 5 x 5 ordinary square convolution kernel. It should be appreciated that the prediction accuracy in certain scenarios using a deformable convolution kernel is better than the prediction accuracy of a normal square convolution kernel.

Convolution kernel	CIFAR dataset	ImageNet data set	SVHN dataset
				3 x 3 Normal (convolution kernel size, same below), 2 (sliding window spacing, same below)	13.63％	9.33％	6.20％
3 x 3 is deformable, 2	7.80％	5.96％	3.59％
				5 x 5 common, 2	11.35％	7.48％	3.27％
5 x 5 Deformable, 2	6.31％	3.69％	1.17％

Further, the deformable convolution is formed by shifting several pixels in a kernel based on a general square convolution kernel, and thus, the shift operation must form shift matrix data. In order to further improve the training effect of the image features and improve the prediction accuracy of the images, the initial setting values of the offset matrix data participate in the image feature learning and can be trained and updated. It will be appreciated that the trained and updated offset matrix data produces better prediction results than the initial values.

EXAMPLE III

Step S1 described in the first embodiment may be further optimized as follows, and the rest of the steps may be kept unchanged.

S101-2, determining a convolution kernel and a sliding window interval;

the result of determining the sliding window interval in the first and second embodiments is

step

1, 2 or 3, which does not consider the image feature location factor, because in most scenes, the probability of the feature to be learned in the image being located at the edge of the image is much lower than that of the non-edge region, and therefore, different sliding window intervals should be determined for the edge region and the non-edge region of the image.

Determining the sliding window interval according to a learning strategy, the learning strategy comprising: setting a larger convolution kernel sliding window interval in a region close to the edge of the image; a smaller convolution kernel sliding window interval is set away from the image edge region. For example,

step

3 or 4 is used in the image edge region, and step 1 or 2 is used in the non-edge region. Different sliding window intervals are determined by adopting the learning strategy, so that the learning efficiency of the image features is obviously improved, the training calculation amount is greatly reduced, and the feature learning quality is hardly reduced.

Example four

The deformable convolution kernel described in example two is applied simultaneously with the determination of the sliding window interval according to the learning strategy described in example three to further optimize step S1 in example one.

S101-3, determining a convolution kernel and a sliding window interval;

the prime convolution kernel adopts a deformable convolution kernel comprising offset matrix parameters, and the offset matrix parameters participate in the image characteristic learning and can be trained and updated; the sliding window interval is determined according to a learning strategy, and a larger convolution kernel sliding window interval is arranged in an area close to the edge of the image; a smaller convolution kernel sliding window interval is provided away from the image edge region, thereby achieving an advantageous effect over the second or third embodiment alone.

The following table shows the prediction error after training on the data sets of CIFAR, ImageNet, SVHN, etc. respectively, for the same 32 pixel by 32 pixel gray-scale map, when the image feature training method uses 3 by 3 deformable convolution kernel and determines the sliding window interval according to the learning strategy, compared with the error of the first, second and third examples. It is clear that the predicted effect of example four is significantly better than that of examples one, two and three.

Convolution kernel	CIFAR dataset	ImageNet data set	SVHN dataset
				3 × 3 normal (convolution kernel size, same below), 2 (sliding window)Interval, the same below)	13.63％	9.33％	6.20％
3 x 3 is deformable, 2	7.80％	5.96％	3.59％
				3 x 3 common, learning strategy to determine sliding window intervals	11.98％	9.02％	5.44％
3 x 3 Deformable, learning strategy determines sliding Window spacing	5.97％	3.55％	1.12％

EXAMPLE five

Fig. 5 is a block diagram illustrating an image feature training apparatus according to an embodiment of the present invention, where the apparatus includes: a profile receiving module 510, a layer number setting module 520, an input determining module 530, and an output determining module 540.

The feature map receiving module 510 is configured to receive a feature map of an image to be feature-trained as an input of a first feature training layer; a layer number setting module 520, configured to set the number of layers of the feature training layer; an input determining module 530, configured to determine, from at least a portion of the outputs of the preceding feature training layers, inputs of the third feature training layer and subsequent feature training layers, where the number of the subsequent feature training layers is associated with the number-of-layers setting module; an output determination module 540, configured to perform at least a convolution operation on the input of each of the feature training layers to determine an output of the feature training layer.

Further, the feature map received by the feature map receiving module 510 is determined based on at least a preconfigured convolution kernel and a sliding window interval, and the feature map includes local features of the training image to be characterized.

Further, the convolution kernel is a deformable convolution kernel that includes offset matrix parameters that participate in image feature learning and can be trained and updated.

Further, the number of layers of the set feature training layer is an integer greater than or equal to 2 and less than or equal to 15, so that the effect of the image feature training is improved.

Further, when the number of layers of the characteristic training layer is set to 5, the image feature training and prediction effects are predicted to be the best in the same type of method.

Further, the input determining module 530 includes: the weight setting module 5301 is configured to set weights of the feature training layers according to the influence degrees of the feature training layers on the subsequent feature training layers; the effective output determining module 5302 is configured to correspondingly assign weights of the feature training layers to outputs of pre-order non-adjacent feature training layers to determine effective outputs of the feature training layers to a current feature training layer; the input adding module 5303 is configured to add effective outputs of the feature training layers to the current feature training layer to outputs of the previous feature training layer, and update an addition result to an input of the current feature training layer.

Further, the effective outputs perform addition by an adder.

EXAMPLE six

Fig. 6 is a block diagram illustrating a structure of an image feature training model according to an embodiment of the present invention, and fig. 7 is a block diagram illustrating a principle of the image feature training model according to the embodiment of the present invention. And sequentially executing at least one group of convolution and pooling operations on the output of the last layer of feature training layer determined by the image feature training method to obtain an image feature training model. Referring to fig. 7, assuming that the last feature training layer is Ln, the output of Ln is sequentially subjected to one or two or more sets of convolution and pooling operations to construct an image feature training model.

It should be understood that, on the basis of the image feature training method formed in steps S1 to S4 (including all sub-steps in each step), the image feature training model can be formed by connecting the output of the last feature training layer to the convolutional layer and the pooling layer in sequence. The image feature training module can be used in the fields of image supervised learning, such as image classification, image detection and the like. The connected convolution layer is used for extracting high-order features of the image, and the pooling layer is used for compressing and reducing the calculation amount of a subsequent neural network full-connection layer and an output layer, so that redundancy is reduced, and overfitting is reduced to finally improve the prediction effect.

Further, fig. 7 includes not only the operation of one set of the convolutional layer and the pooling layer, but also the scenes of multiple sets of the convolutional layer and the pooling layer, and this design can better learn the image features, is helpful to further learn the high-order features of the image, and has higher prediction accuracy.

It should be understood that at least one set of convolution and pooling operations, necessary to perform the set of convolution and pooling operations in sequence, may be followed by several convolution operations, or several pooling operations, or several convolution and several pooling operations. That is, the following operations are possible and effective:

performing convolution and pooling operations on o (Ln) in sequence;

performing convolution, pooling, convolution and pooling on the o (Ln) in sequence;

performing convolution, pooling, convolution and convolution operations on the o (Ln) in sequence;

performing convolution, pooling, convolution, pooling and pooling operations on the o (Ln) in sequence;

by analogy, the description is not repeated.

EXAMPLE seven

A computer storage medium storing computer readable instructions which, when executed by the computer, cause the computer to perform the image feature training method. Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-mentioned embodiments are only for illustrating the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and to implement the same, and the present invention is not limited to the embodiments, i.e. all equivalent changes or modifications made according to the spirit of the present invention are still within the scope of the present invention.

Claims

1. An image feature training method, characterized in that the method comprises:

receiving a feature map of an image to be feature-trained as input of a first feature training layer;

setting the number of layers of the characteristic training layer;

the input of the third feature training layer and each subsequent feature training layer is determined by the output of at least part of non-adjacent preorders of the feature training layers, and the number of each subsequent feature training layer is related to the number of layers of the set feature training layers;

performing at least one convolution operation on the input of each of the feature training layers to determine the output of the feature training layer.

2. The image feature training method of claim 1, wherein the feature map is determined based on at least a preconfigured convolution kernel and a sliding window interval, and the feature map contains local features of the training image to be characterized.

3. The image feature training method according to claim 2, wherein the type of the convolution kernel is determined at least according to a feature shape of the training image to be characterized.

4. The image feature training method according to claim 2, wherein the convolution kernel is a deformable convolution kernel including offset matrix parameters, which participate in image feature learning and can be trained and updated.

5. The image feature training method of claim 1, wherein the number of layers of the set feature training layer is an integer greater than or equal to 3 and less than or equal to 15, so as to improve the image feature training effect.

6. The image feature training method of claim 1, wherein the determining of the input of the third feature training layer and each subsequent feature training layer from at least the output of a partial non-adjacent preceding feature training layer comprises:

setting weights of the feature training layers according to the influence degree of the feature training layers on the subsequent feature training layers;

correspondingly giving the weight of each characteristic training layer to the output of each characteristic training layer which is not adjacent to the preorder to determine the effective output of each characteristic training layer to the current characteristic training layer;

and adding the effective output of each feature training layer to the current feature training layer with the output of the last feature training layer, and updating the input of the current feature training layer according to the addition result.

7. An image feature training apparatus, characterized in that the apparatus comprises:

the characteristic diagram receiving module is used for receiving a characteristic diagram of an image to be subjected to characteristic training as input of a first characteristic training layer;

the layer number setting module is used for setting the layer number of the characteristic training layer;

an input determining module, configured to determine, at least from outputs of a part of the pre-order feature training layers, inputs of the third feature training layer and subsequent feature training layers, where the number of the subsequent feature training layers is associated with the number-of-layers setting module;

and the output determining module is used for at least performing convolution operation on the input of each characteristic training layer to determine the output of the characteristic training layer.

Determining the feature map at least based on a preset convolution kernel and a sliding window interval, wherein the feature map comprises local features of the training image to be characterized.

The convolution kernel is a deformable convolution kernel that includes offset matrix parameters that participate in image feature learning and can be trained and updated.

The number of layers of the set feature training layer is an integer which is greater than or equal to 2 and less than or equal to 15, so that the effect of image feature training is improved.

8. The image feature training device of claim 7, wherein the input determination module comprises:

the weight setting module is used for setting the weight of each characteristic training layer according to the influence degree of each characteristic training layer on each subsequent characteristic training layer;

an effective output determining module, configured to correspondingly assign weights of the feature training layers to outputs of pre-order non-adjacent feature training layers to determine effective outputs of the feature training layers to a current feature training layer;

and the input adding module is used for adding the effective output of each characteristic training layer to the current characteristic training layer with the output of the previous characteristic training layer and updating the input of the current characteristic training layer according to the addition result.

9. An image feature training model, characterized in that at least one set of convolution and pooling operations is performed on the output of the last feature training layer determined by the image feature training method according to any one of claims 1 to 6 in sequence to obtain the image feature training model.

10. A computer storage medium storing computer readable instructions which, when executed by the computer, cause the computer to perform the image feature training method of any one of claims 1-7.