CN113887447A

CN113887447A - Training method of object classification model, object classification prediction method and device

Info

Publication number: CN113887447A
Application number: CN202111173528.5A
Authority: CN
Inventors: 王洪昌; 鉴海防; 鲁华祥
Original assignee: Institute of Semiconductors of CAS
Current assignee: Institute of Semiconductors of CAS
Priority date: 2021-10-08
Filing date: 2021-10-08
Publication date: 2022-01-04

Abstract

The embodiment of the disclosure provides a training method of an object classification model, an object classification prediction method and a device, wherein the training method of the object classification model comprises the following steps: acquiring a training sample data set, wherein training samples in the training sample data set comprise training images and label data of the training images, the training images represent multiple postures of an object, the object comprises an animal, and the label data comprises real posture classification information aiming at the postures; inputting the training image into a deep neural network model, and outputting a recognition result, wherein the recognition result comprises predicted posture classification information; calculating a loss function according to the identification result and the label data to obtain a loss result; and iteratively adjusting network parameters of the deep neural network model according to the loss result to generate a trained object classification model.

Description

Training method of object classification model, object classification prediction method and device

Technical Field

The present disclosure relates to the technical field of artificial intelligence and ecological protection, and more particularly, to a method and an apparatus for training an object classification model, an object classification prediction method, an apparatus for training an object classification model, an object classification prediction apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

In the ecological field, different animals have different environmental sensitivities, and animals with higher environmental sensitivities can be taken as indicator species of the ecological environment, such as birds, for this feature.

In implementing the disclosed concept, the inventors found that there are at least the following problems in the related art: the statistics of the population posture of birds and other animals depends on artificial statistics methods with high labor cost.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide a training method for an object classification model, an object classification prediction method, a training apparatus for an object classification model, an object classification prediction apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

One aspect of the embodiments of the present disclosure provides a method for training an object classification model, including:

acquiring a training sample data set, wherein training samples in the training sample data set comprise training images and label data of the training images, the training images represent multiple postures of an object, the object comprises an animal, and the label data comprises real posture classification information aiming at the postures;

inputting the training image into a deep neural network model, and outputting a recognition result, wherein the recognition result comprises predicted posture classification information;

calculating a loss function according to the identification result and the label data to obtain a loss result; and

iteratively adjusting network parameters of the deep neural network model according to the loss result to generate a trained object classification model.

According to an embodiment of the present disclosure, the kinds of the above objects include a plurality; the tag data further includes real category classification information generated according to a category of the object; the recognition result further includes prediction category classification information.

According to an embodiment of the present disclosure, the deep neural network model includes a first feature extraction network, a first object class classification network, and a first object posture classification network;

the first object type classification network and the first object posture classification network respectively comprise a first convolution layer, a maximum value pooling layer, a second convolution layer, an average value pooling layer and a full connection layer which are sequentially cascaded.

According to an embodiment of the present disclosure, the inputting the training image into the deep neural network model and outputting the recognition result includes:

inputting the training image into the first feature extraction network, and outputting a first feature map;

inputting the first characteristic diagram into the first convolution layer and outputting a second characteristic diagram;

inputting the second characteristic diagram into the maximum value pooling layer, and outputting a compressed second characteristic diagram;

inputting the compressed second characteristic diagram into the second convolution layer and outputting a third characteristic diagram;

inputting the third feature map into the average pooling layer, and outputting a compressed third feature map;

and inputting the compressed third feature map into the full-link layer, and outputting the predicted posture classification information or the predicted category classification information.

According to an embodiment of the present disclosure, the label data further includes a real density map generated according to the object, and the recognition result further includes a predicted density map, where the real density map represents a real aggregation degree of the object at a plurality of positions in the training image, and the predicted density map represents a predicted aggregation degree of the object at a plurality of positions in the training image.

According to an embodiment of the present disclosure, the deep neural network model includes a second feature extraction network, a void convolution layer, a density map generation network, and a second object posture classification network;

wherein, the above-mentioned inputting the above-mentioned training image into the deep neural network model, the output recognition result includes:

inputting the training image into the second feature extraction network, and outputting a fourth feature map;

inputting the fourth feature map into the second object posture classification network, and outputting the predicted posture classification information;

inputting the fourth characteristic diagram into the void convolution layer and outputting a fifth characteristic diagram;

and inputting the fifth feature map into the density map generation network, and outputting a predicted density map.

According to an embodiment of the present disclosure, the training method of the object classification model further includes:

acquiring a plurality of initial training images;

under the condition that the pixel size of the initial training image does not meet the preset condition, adjusting the pixel size of the image to be recognized to obtain the training image;

labeling the training image to obtain the label data; and

and generating the training sample data set according to a plurality of training images and the label data corresponding to the training images.

Another aspect of the embodiments of the present disclosure provides an object classification prediction method, including:

acquiring an image to be recognized, wherein the image to be recognized represents a plurality of postures of an object, and the object comprises an animal; and

inputting the image to be recognized into an object classification model to obtain a recognition result, wherein the recognition result comprises target posture classification information;

the object classification model is obtained by training by the method.

According to an embodiment of the present disclosure, the kinds of the objects include a plurality, and the animals include birds; the recognition result further comprises target category classification information and/or a target density map, wherein the target density map represents the aggregation degree of the objects at different positions in the image to be recognized;

wherein, the method further comprises:

and generating target quantity information of the objects according to the target density map.

Another aspect of the embodiments of the present disclosure provides a training apparatus for an object classification model, including:

a first obtaining module, configured to obtain a training sample data set, where a training sample in the training sample data set includes a training image and tag data of the training image, where the training image represents multiple postures of an object, the object includes an animal, and the tag data includes real posture classification information for the postures;

the output module is used for inputting the training image into a deep neural network model and outputting a recognition result, wherein the recognition result comprises predicted posture classification information;

the calculation module is used for calculating a loss function according to the identification result and the label data to obtain a loss result; and

and the iterative training module is used for iteratively adjusting the network parameters of the deep neural network model according to the loss result to generate a trained object classification model.

Another aspect of the embodiments of the present disclosure provides an object classification prediction apparatus, including:

the second acquisition module is used for acquiring an image to be recognized, wherein the image to be recognized represents a plurality of postures of an object, and the object comprises an animal; and

the prediction module is used for inputting the image to be recognized into an object classification model to obtain a recognition result, wherein the recognition result comprises target posture classification information;

the object classification model is obtained by training by the method.

Another aspect of an embodiment of the present disclosure provides an electronic device including: one or more processors; memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described above.

Another aspect of embodiments of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.

Another aspect of an embodiment of the present disclosure provides a computer program product comprising computer executable instructions for implementing the method as described above when executed.

According to the embodiment of the disclosure, by training the deep neural network by using the training image including the postures of the objects such as the animals, the object classification model capable of predicting the postures of the objects can be obtained, so that the postures of the objects such as the animals can be predicted by using the classification object model, and therefore, the technical problem that the statistics of the population postures of the animals such as birds depends on an artificial statistics method with high labor cost is at least partially overcome, and the technical effect of reducing the statistics cost of the postures of the objects such as the animals is achieved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates an exemplary system architecture of a training method or an object classification prediction method applying an object classification model according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow chart of a method of training an object classification model according to an embodiment of the present disclosure;

FIG. 3 schematically shows a flow diagram for generating a training sample data set according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a method diagram of a method of training an object classification model according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of an object classification prediction method according to an embodiment of the present disclosure;

FIG. 6 schematically shows a block diagram of a training apparatus of an object classification model according to an embodiment of the present disclosure;

fig. 7 schematically shows a block diagram of an object classification prediction apparatus according to an embodiment of the present disclosure; and

fig. 8 schematically illustrates a block diagram of an electronic device implementing a training method or an object classification prediction method of an object classification model according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Birds are often used as an indicator species of the ecological environment due to their high susceptibility. In addition, in the case of acquiring a large amount of data on bird population counts, the specific location of bird population distribution can be roughly determined, thereby making a bird monitoring plan. Therefore, bird population counting is of great significance to ecological environment protection.

However, current bird population counts are still manually counted using some traditional statistical methods with high labor costs.

Although many researchers have proposed a number of deep neural network-based methods for population counting, these methods do not directly translate into avian population counting. Because the crowd counting algorithm mainly tries to extract valid information from the head of a person in the task of crowd counting, the posture of the person has little influence on the finally extracted density information. However, the head of a bird is only a small part of the entire body, and posture has a great influence on the morphological characteristics of the bird. In addition, the morphological characteristics of different species also vary greatly, so the population counting algorithm is difficult to adapt to the task of bird population density estimation.

In view of this, the inventor finds that a deep neural network model can be trained by using a training sample data set including multiple postures of an object such as a bird, and a trained object classification model can be obtained, so that the posture of the object such as the bird can be predicted by using the object classification model, and further, waste of manpower and financial resources caused by a manual statistics mode can be avoided.

Embodiments of the present disclosure provide a training method of an object classification model, an object classification prediction method, a training apparatus of an object classification model, an object classification prediction apparatus, an electronic device, a computer-readable storage medium, and a computer program product. The training method of the object classification model comprises the steps of obtaining a training sample data set, wherein training samples in the training sample data set comprise training images and label data of the training images, the training images represent multiple postures of an object, the object comprises an animal, and the label data comprise real posture classification information aiming at the postures; inputting the training image into a deep neural network model, and outputting a recognition result, wherein the recognition result comprises predicted posture classification information; calculating a loss function according to the identification result and the label data to obtain a loss result; and iteratively adjusting network parameters of the deep neural network model according to the loss result to generate a trained object classification model.

Fig. 1 schematically illustrates an exemplary system architecture 100 to which a training method of an object classification model or an object classification prediction method may be applied, according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

A user may use the

terminal devices

101, 102, 103 to interact with the server 105 over the network 104 to receive or transmit data or the like, the data comprising a set of training sample data or an initial training image. Various client applications may be installed on the

terminal devices

101, 102, 103, such as an object classification application, a web browser application, a search class application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and otherwise process the received data such as the user request, and feed back a processing result (for example, an object classification model or target posture classification information generated from the user data) to the terminal device.

It should be noted that the training method of the object classification model or the object classification prediction method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the training device or the object classification predicting device of the object classification model provided by the embodiment of the present disclosure may be generally disposed in the server 105. The training method or the object classification prediction method of the object classification model provided in the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and can communicate with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the training device or the object classification prediction device of the object classification model provided in the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Alternatively, the training method or the object classification prediction method of the object classification model provided in the embodiment of the present disclosure may also be executed by the

terminal device

101, 102, or 103, or may also be executed by another terminal device different from the

terminal device

101, 102, or 103. Accordingly, the training apparatus or the object classification predicting apparatus of the object classification model provided in the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103, or in another terminal device different from the

terminal device

101, 102, or 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically shows a flow chart of a method of training an object classification model according to an embodiment of the present disclosure.

As shown in fig. 2, the training method of the object classification model may include operations S210 to S240.

In operation S210, a training sample data set is obtained, where training samples in the training sample data set include training images and tag data of the training images, the training images represent a plurality of poses of an object, the object may include an animal, and the tag data includes real pose classification information for the poses.

In operation S220, the training image is input to the deep neural network model, and a recognition result including predicted gesture classification information is output.

In operation S230, a loss function is calculated according to the recognition result and the tag data, resulting in a loss result.

In operation S240, network parameters of the deep neural network model are iteratively adjusted according to the loss result, generating a trained object classification model.

According to embodiments of the present disclosure, the object may also include a plant, for example, the plant may include a tree or a flower, or the like.

According to embodiments of the present disclosure, animals may include mammals, fish, birds, amphibians, insects, and the like. To facilitate the description of the invention, the following examples are illustrated in birds.

According to embodiments of the present disclosure, the pose of the bird in the training image may include a standing front view, a standing side view, a flight front view, a flight side view, and a flight bottom view.

According to an embodiment of the present disclosure, the tag data may be included in a format of a table file, the content of the table file may include real pose classification information and corresponding position coordinates, and the real pose classification information may include number information that numbers the different poses described above.

According to the embodiment of the disclosure, the feature extraction part of the deep Neural network model may adopt a Convolutional Neural Network (CNN), for example, a VGG-16 network with the last three layers of convolutions and all fully connected layers removed may be adopted, and the weight parameters of the VGG-16 network are pre-trained.

The deep neural network of the present embodiment is not limited to a convolutional neural network such as a VGG-16 network, and may be another type of neural network. The disclosed embodiments are not so limited.

According to the embodiment of the disclosure, the acquired training image is input into a deep neural network model, the predicted pose classification information is output, a loss function is calculated according to the predicted pose classification information and the label data, and the network parameters of the deep neural network model are iteratively adjusted according to a loss result. In case the loss result converges, a trained object classification model may be obtained.

According to an embodiment of the present disclosure, the kind of the object includes a plurality of kinds. The tag data further includes real category classification information generated according to the category of the object. The recognition result further includes prediction category classification information.

According to an embodiment of the present disclosure, the category classification information may include a category of birds, for example, birds may include birds of swimming, wading, roosting, singing, climbing, and the like. The species of birds are about 220 or more species, and will not be described in detail.

According to an embodiment of the present disclosure, the deep neural network model includes a first feature extraction network, a first object class classification network, and a first object pose classification network.

The first object type classification network and the first object posture classification network respectively comprise a first convolution layer, a maximum value pooling layer, a second convolution layer, an average value pooling layer and a full-connection layer which are sequentially cascaded.

According to an embodiment of the present disclosure, the first object class classification network and the first object pose classification network are respectively connected with an output of the first feature extraction network.

According to an embodiment of the present disclosure, a specific structure of the first feature extraction network may include convolutional layers having a two-layer convolutional kernel size of 3 × 3, convolutional layers having a filter number of 64, a maximum pooling layer, convolutional layers having a two-layer convolutional kernel size of 3 × 3, convolutional layers having a filter number of 128, and maximum pooling layers, a three-layer filter size of 256, a former two-layer convolutional kernel size of 3 × 3, a last layer of 1 × 1 convolutional layers, and maximum pooling layers, a three-layer filter number of 512, a former two-layer convolutional kernel size of 3 × 3, and a last layer of convolutional kernels having a size of 1 × 1.

According to an embodiment of the present disclosure, the first object category classification network and the first object pose classification network may function to recognize the pose of the bird and the belonging classification in the training image, which is a rough classification.

According to the embodiment of the disclosure, in order to improve the identification speed of the object classification model and reduce the number of layers of the networks in the object classification model, the specific structure of the two networks may include a first convolutional layer with a convolution kernel of 3 × 3, the number of filters being 512, and a step size being 2, one or more maximum pooling layers, the maximum pooling layer may rapidly reduce the data information, the two layers of convolution kernels are 3 × 3, the number of filters being 512, the first convolutional layer with a step size being 1, and the second convolutional layer with a step size being 2, the second convolutional layer may reduce the data information to achieve the effect of rapid dimensionality reduction, and one or more average pooling layers and at least two full connection layers to achieve the classification processing of the standing front view, the standing side view, the flight front view, the flight side view and the flight bottom view of the birds.

It should be noted that, the specific parameters of each layer in the object classification model are not intended to limit the scope of the present disclosure, but are merely examples given to illustrate the technical solutions of the present application, and the specific parameters may be specifically set according to requirements.

According to the embodiment of the disclosure, inputting the training image into the deep neural network model and outputting the recognition result may include the following operations.

And inputting the training image into a first feature extraction network, and outputting a first feature map. Inputting the first characteristic diagram into the first convolution layer and outputting the second characteristic diagram. And inputting the second characteristic diagram into the maximum value pooling layer, and outputting the compressed second characteristic diagram. Inputting the compressed second characteristic diagram into the second convolution layer, and outputting a third characteristic diagram. And inputting the third feature map into the average value pooling layer, and outputting the compressed third feature map. And inputting the compressed third feature map into a full-connection layer, and outputting predicted posture classification information or predicted category classification information.

According to an embodiment of the present disclosure, the size of the first feature map may include 128 × 96 × 512, the unit of the size being a pixel. The units of dimensions in the following examples are pixels and are not described separately.

According to an embodiment of the present disclosure, the size of the data information of the compressed second feature map may be one fourth of the size of the data information of the second feature map. The size of the data information of the third profile may be one-half of the size of the data information of the compressed second profile.

According to an embodiment of the present disclosure, the output of the full connection layer in the first object kind classification network may be a plurality of channels, the number of which corresponds to the number of kinds of birds, for example, 13 kinds corresponding to the kinds of birds when the number of channels is 13.

It should be noted that, in the embodiments of the present disclosure and the pixel values of 128 × 96 × 512 or 128 × 96 × 1 and the like mentioned below are not intended to limit the scope of the present disclosure, but are merely examples given for illustrating the technical solutions of the present application, and specific values thereof may be specifically set according to requirements.

According to an embodiment of the disclosure, the tag data further comprises a true density map generated from the object, the recognition result further comprises a predicted density map characterizing a true degree of clustering of the object at a plurality of locations in the training image, the predicted density map characterizing a predicted degree of clustering of the object at a plurality of locations in the training image.

According to an embodiment of the present disclosure, the deep neural network model includes a second feature extraction network, a hole convolution layer, a density map generation network, and a second object pose classification network.

Inputting the training image into the deep neural network model and outputting a recognition result, which may include the following operations.

And inputting the training image into a second feature extraction network, and outputting a fourth feature map. And inputting the fourth feature map into a second object posture classification network, and outputting predicted posture classification information. Inputting the fourth feature map into the void convolution layer, and outputting a fifth feature map. And inputting the fifth feature map into a density map generation network, and outputting a predicted density map.

According to the embodiment of the present disclosure, the network structure and function of the second feature extraction network are the same as those of the first feature extraction network, and the network structure and function of the second object posture classification network are the same as those of the first object posture classification network, which are not described herein again.

According to embodiments of the present disclosure, a density map may be used to represent a plot of bird population and distribution. The density map generation network may refer to an algorithmic structure built using a convolutional neural network for generating a density map.

According to the embodiment of the disclosure, in order to increase the context correlation between the pixel points in the fourth feature map and in order to extract more density information by the neural network, the fourth feature map may be processed by using a plurality of hole convolutional layers, the number of the hole convolutional layers is preferably 6, and each hole convolutional layer adds a supplementary pixel point during convolution, so as to ensure that the size of the output fifth feature map is the same as that of the fourth feature map.

According to the embodiment of the present disclosure, in the case that the number of the hole convolution layers is 6, the number of the filters of the first three hole convolution layers may be 512, and the number of the filters of the last three hole convolution layers may be 256, 128, and 64, respectively, which may have a main effect of rapidly reducing the data amount.

According to the embodiment of the present disclosure, a third convolutional layer with a convolutional kernel size of 3 × 3 and a filter size of 1 may be further included between the hole convolutional layer and the density map generation network, so as to ensure that the size of the predicted density map output by the density map generation network is a preset size, for example, the preset size may be 128 × 96 × 1.

According to an embodiment of the present disclosure, the density map generation network measures the difference between the estimated density map and the true density map using euclidean distances. The loss function of the density map generating network is shown in the following equation.

Wherein θ represents a set of learnable parameters in the density map generation network; n represents the number of training images; x_iRepresenting an input training image; f_iRepresenting a training image X_iTrue density map of (a); f (X)_i(ii) a θ) represents a predicted density map generated by the density map generation network, and is applied to the training image X_iA parameterization is performed. L (θ) represents a loss value between the predicted density map and the true density map.

According to the embodiment of the present disclosure, in the case that the first object category classification network, the first object posture classification network and the density map generation network are in the same object classification model, the first object category classification network, the first object posture classification network and the density map generation network are respectively connected with the feature extraction network of the object classification model, for example, the feature extraction network of the object classification model may be the first feature extraction network or the second feature extraction network.

According to an embodiment of the present disclosure, the first object pose classification network and the first object species classification network output a plurality of different poses and perspectives and categories of a plurality of birds, respectively. The loss functions of the two classification networks can use cross-entropy loss functions and are named L respectively according to postures and categories_pAnd L_s。

According to the embodiment of the disclosure, by balancing the loss of the two classification networks and the density map generation network, in the training process, the two classification networks and the object classification model with the density map generation network participate in gradient updating at the same time, so that morphological priori knowledge is provided for the object classification model from two aspects of the posture and the type of birds, and the weight of the object classification model is suitable for statistics of the number of the birds. Both classification networks may be plug and play modules.

According to an embodiment of the present disclosure, a loss function of an object classification model having a first object class classification network, a first object pose classification network, and a density map generation network is shown in the following formula.

L＝0.9×L_θ+0.05×L_p+0.05×L_s

Wherein L is_θA loss function representing the density map generation network; l is_pA loss function representing a first object class classification network; l is_sA loss function representing a first object pose classification network; l represents the loss function of the object classification model.

According to an embodiment of the present disclosure, a trained object classification model is generated in case the loss function L of the object classification model satisfies a preset threshold condition. The preset threshold condition may be specifically set according to actual requirements, and may be, for example, less than 2.5.

Fig. 3 schematically shows a flow chart of generating a training sample data set according to an embodiment of the present disclosure.

As shown in fig. 3, the training method of the object classification model may further include operations S310 to S340.

In operation S310, a plurality of initial training images are acquired.

In operation S320, in a case that the pixel size of the initial training image does not satisfy the preset condition, the pixel size of the image to be recognized is adjusted to obtain the training image.

In operation S330, label data is obtained according to labeling of the training image.

In operation S340, a training sample data set is generated according to a plurality of training images and label data corresponding to the training images.

According to an embodiment of the present disclosure, the preset condition may include a size of 1024 × 768.

According to the embodiment of the present disclosure, in the case that the pixel size of the initial training image does not satisfy the preset condition, the pixel of the initial training image needs to be adjusted, for example, the pixel of the initial training image may be adjusted by using a pixel operation algorithm, which may include, but is not limited to, Resize algorithm.

According to the embodiment of the present disclosure, when performing the labeling operation, the following operation steps may be adopted.

For example, in the case of manually using a mouse to add a gesture annotation for a bird, a mouse listening event may be added in the software. When a mouse clicks a certain position in a training image, a mouse monitoring event is triggered, a position coordinate corresponding to the position is stored, and a drawing component is called to draw a marking dot by taking a position coordinate point as a center. The posture of the birds is determined according to the postures of the birds in the whole training image, and the postures of the birds are numbered. Or in the case of labeling the kind of birds, the kind of birds may be labeled in the training image and may be subjected to kind numbering.

In accordance with embodiments of the present disclosure, prior knowledge of the birds' pose and species needs to be incorporated in the annotation process. A priori knowledge may characterize information that is already clear to assist in identification.

According to an embodiment of the present disclosure, after the labeling of the training image is completed, tag data for predicting the pose of the bird is generated from a plurality of position coordinates and the pose and/or pose number of the bird corresponding to each position coordinate. Or tag data for predicting the kind of birds may be generated from a plurality of position coordinates and the kind and/or kind number of birds corresponding to each position coordinate. Or tag data for predicting the kind and posture of the bird may be generated from the plurality of position coordinates and the kind and/or kind number of the bird corresponding to each position coordinate, and the posture and/or posture number of the bird.

Fig. 4 schematically shows a method schematic of a training method of an object classification model according to an embodiment of the present disclosure.

As shown in fig. 4, when the deep neural network model is trained, training images in the training sample data set are sequentially input to the deep neural network model, and the first feature extraction network can extract the features of birds in the training images, thereby outputting a first feature map.

According to an embodiment of the present disclosure, the first feature map is input to a first object class classification network, a first object posture classification network, or a density map generation network to obtain a recognition result containing predicted class classification information, predicted posture classification information, or a predicted density map.

According to the embodiment of the disclosure, a loss function is calculated according to the recognition result and the label data to obtain a loss result, and under the condition that the loss result does not meet the preset threshold condition, the network parameters of the deep neural network model are iteratively adjusted, so that the trained object classification model can be generated finally.

Fig. 5 schematically shows a flow chart of an object classification prediction method according to an embodiment of the present disclosure.

As shown in fig. 5, the object classification prediction method may include operations S510 to S520.

In operation S510, an image to be recognized is acquired, the image to be recognized representing a plurality of poses of an object, the object including an animal.

In operation S520, the image to be recognized is input into the object classification model, and a recognition result is obtained, where the recognition result includes target posture classification information.

The object classification model is obtained by training by the method.

According to the embodiment of the disclosure, when the image to be recognized is obtained, the size of the image needs to be judged, and when the size meets the preset condition, the image is input to the object classification model for recognition, and the preset condition may be the same as or different from the preset condition about the size of the training image in the training method of the object classification model.

According to the embodiment of the disclosure, the size of the image to be recognized can be adjusted by using the pixel operation algorithm under the condition that the image to be recognized does not meet the preset condition.

According to an embodiment of the present disclosure, the species of the subject includes a plurality, the animal includes a bird; the recognition result further comprises target category classification information and/or a target density map, and the target density map represents the aggregation degree of the objects at different positions in the image to be recognized.

According to an embodiment of the present disclosure, the object classification prediction method may further include the following operations.

And generating target quantity information of the object according to the target density map.

According to the embodiment of the disclosure, in the case of counting the number of birds, the target density map predicted by the object classification model may be subjected to an integration operation to obtain target number information of the birds on the image to be recognized.

According to the embodiment of the disclosure, since the first object posture classification network and the first object category classification network are provided as plug-and-play modules, the two classification networks can be omitted in the case of counting the number of birds, so that the calculation amount of the object classification model is reduced, and the calculation efficiency is improved.

Fig. 6 schematically shows a block diagram of a training apparatus of an object classification model according to an embodiment of the present disclosure.

As shown in fig. 6, the training apparatus 600 for the object classification model may include a first obtaining module 610, an outputting module 620, a calculating module 630, and an iterative training module 640.

The first obtaining module 610 is configured to obtain a training sample data set, where training samples in the training sample data set include training images and tag data of the training images, the training images represent multiple poses of an object, the object includes an animal, and the tag data includes real pose classification information for the poses.

The output module 620 is configured to input the training image into the deep neural network model, and output a recognition result, where the recognition result includes predicted gesture classification information.

The calculating module 630 is configured to calculate a loss function according to the identification result and the tag data to obtain a loss result.

The iterative training module 640 is configured to iteratively adjust network parameters of the deep neural network model according to the loss result to generate a trained object classification model.

According to an embodiment of the present disclosure, the deep neural network model may include a first feature extraction network, a first object class classification network, and a first object pose classification network.

According to an embodiment of the present disclosure, each of the first object category classification network and the first object posture classification network may include a first convolution layer, a maximum value pooling layer, a second convolution layer, an average value pooling layer, and a full connection layer, which are sequentially cascaded.

According to an embodiment of the present disclosure, the output module 620 may include a first input unit, a second input unit, a third input unit, a fourth input unit, a fifth input unit, and a sixth input unit.

The first input unit is used for inputting the training image into the first feature extraction network and outputting a first feature map.

The second input unit is used for inputting the first characteristic diagram into the first convolution layer and outputting a second characteristic diagram.

And the third input unit is used for inputting the second characteristic diagram into the maximum value pooling layer and outputting the compressed second characteristic diagram.

The fourth input unit is used for inputting the compressed second characteristic diagram into the second convolution layer and outputting a third characteristic diagram.

And the fifth input unit is used for inputting the third feature map into the average value pooling layer and outputting the compressed third feature map.

And the sixth input unit is used for inputting the compressed third feature map into the full-link layer and outputting the predicted posture classification information or the predicted category classification information.

According to an embodiment of the present disclosure, the tag data may further include a true density map generated from the object, and the recognition result may further include a predicted density map characterizing a true degree of clustering of the object at a plurality of locations in the training image, the predicted density map characterizing a predicted degree of clustering of the object at a plurality of locations in the training image.

According to an embodiment of the present disclosure, the deep neural network model may include a second feature extraction network, a hole convolution layer, a density map generation network, and a second object pose classification network.

According to an embodiment of the present disclosure, the output module 620 may include a seventh input unit, an eighth input unit, a ninth input unit, and a tenth input unit.

And the seventh input unit is used for inputting the training image into the second feature extraction network and outputting a fourth feature map.

The eighth input unit is configured to input the fourth feature map into the second object posture classification network, and output predicted posture classification information.

The ninth input unit is used for inputting the fourth characteristic diagram into the void convolution layer and outputting the fifth characteristic diagram.

And the tenth input unit is used for inputting the fifth feature map into the density map generation network and outputting the predicted density map.

According to an embodiment of the present disclosure, the training apparatus 600 for an object classification model may further include a third obtaining module, an adjusting module, a labeling module, and a first generating module.

The third acquisition module is used for acquiring a plurality of initial training images.

The adjusting module is used for adjusting the pixel size of the image to be recognized to obtain the training image under the condition that the pixel size of the initial training image does not meet the preset condition.

And the marking module is used for marking the training image to obtain the label data.

The first generation module is used for generating a training sample data set according to a plurality of training images and label data corresponding to the training images.

Fig. 7 schematically shows a block diagram of an object classification prediction apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the object classification predicting apparatus may include a second obtaining module 710 and a predicting module 720.

The second acquisition module 710 is configured to acquire an image to be recognized, the image to be recognized representing a plurality of poses of an object, the object including an animal.

The prediction module 720 is configured to input the image to be recognized into the object classification model to obtain a recognition result, where the recognition result includes target posture classification information.

The object classification model is obtained by training by the method.

According to an embodiment of the present disclosure, the species of the subject may include a plurality, and the animal may include a bird; the recognition result may further comprise target category classification information and/or a target density map characterizing the degree of aggregation of objects at different positions in the image to be recognized.

According to an embodiment of the present disclosure, the object classification predicting apparatus 700 may further include a second generating module.

The second generation module is used for generating target quantity information of the object according to the target density map.

Any of the modules, units, or at least part of the functionality of any of them according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules and units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, units according to the embodiments of the present disclosure may be implemented at least partially as a hardware Circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a Circuit, or implemented by any one of three implementations of software, hardware, and firmware, or any suitable combination of any of them. Alternatively, one or more of the modules, units according to embodiments of the present disclosure may be implemented at least partly as computer program modules, which, when executed, may perform the respective functions.

For example, any number of the first obtaining module 610, the output module 620, the calculating module 630, and the iterative training module 640, or the second obtaining module 710 and the predicting module 720 may be combined and implemented in one module/unit, or any one of them may be split into a plurality of modules/units. Alternatively, at least part of the functionality of one or more of these modules/units/sub-units may be combined with at least part of the functionality of other modules/units/sub-units and implemented in one module/unit. According to an embodiment of the present disclosure, at least one of the first obtaining module 610, the output module 620, the calculating module 630, and the iterative training module 640, or the second obtaining module 710 and the predicting module 720 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware, and firmware, or by a suitable combination of any several of them. Alternatively, at least one of the first acquisition module 610, the output module 620, the calculation module 630 and the iterative training module 640, or the second acquisition module 710 and the prediction module 720 may be at least partially implemented as a computer program module that, when executed, may perform a corresponding function.

It should be noted that, in the embodiment of the present disclosure, the training device portion of the object classification model corresponds to the training method portion of the object classification model in the embodiment of the present disclosure, and the description of the training device portion of the object classification model specifically refers to the training method portion of the object classification model, which is not described herein again. The object classification prediction device part in the embodiment of the present disclosure corresponds to the object classification prediction method part in the embodiment of the present disclosure, and the description of the object classification prediction device part specifically refers to the object classification prediction method part, which is not described herein again.

Fig. 8 schematically shows a block diagram of an electronic device adapted to implement the above described method according to an embodiment of the present disclosure. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 801 may also include onboard memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.

In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or RAM 803. Note that the programs may also be stored in one or more memories other than the ROM 802 and RAM 803. The processor 801 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 800 may also include input/output (I/O) interface 805, input/output (I/O) interface 805 also connected to bus 804, according to an embodiment of the present disclosure. The system 800 may also include one or more of the following components connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output portion 807 including a Display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program, when executed by the processor 801, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable Computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory), a portable compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the preceding. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 802 and/or RAM 803 described above and/or one or more memories other than the ROM 802 and RAM 803.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method provided by the embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code being configured to cause the electronic device to implement the method for training an object classification model or the method for predicting an object classification provided by the embodiments of the present disclosure.

The computer program, when executed by the processor 801, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via communication section 809, and/or installed from removable media 811. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A method of training an object classification model, comprising:

acquiring a training sample data set, wherein training samples in the training sample data set comprise training images and label data of the training images, wherein the training images represent a plurality of postures of an object, the object comprises an animal, and the label data comprises real posture classification information for the postures;

2. The method of claim 1, wherein the class of objects comprises a plurality; the tag data further includes real category classification information generated according to a category of the object; the recognition result further includes prediction category classification information.

3. The method of claim 2, wherein the deep neural network model comprises a first feature extraction network, a first object class classification network, and a first object pose classification network;

4. The method of claim 3, wherein the inputting the training image into a deep neural network model and outputting a recognition result comprises:

inputting the third feature map into the average value pooling layer, and outputting a compressed third feature map;

and inputting the compressed third feature map into the full-connection layer, and outputting the predicted posture classification information or the predicted category classification information.

5. The method of claim 1, wherein the label data further comprises a true density map generated from the object, the recognition result further comprising a predicted density map, wherein the true density map characterizes a true degree of clustering of the object at a plurality of locations in the training image, and the predicted density map characterizes a predicted degree of clustering of the object at a plurality of locations in the training image.

6. The method of claim 1 or 5, the deep neural network model comprising a second feature extraction network, a hole convolution layer, a density map generation network, and a second object pose classification network;

wherein, the inputting the training image into the deep neural network model and outputting the recognition result comprises:

inputting the fourth feature map into the void convolution layer, and outputting a fifth feature map;

7. The method of claim 1, further comprising:

acquiring a plurality of initial training images;

under the condition that the pixel size of the initial training image does not meet a preset condition, adjusting the pixel size of the image to be recognized to obtain the training image;

labeling the training image to obtain the label data; and

and generating the training sample data set according to the plurality of training images and the label data corresponding to the training images.

8. An object classification prediction method, comprising:

wherein the object classification model is trained using the method of any one of claims 1 to 7.

9. The method of claim 8, wherein the species of subject includes a plurality, the animal includes a bird; the identification result further comprises target category classification information and/or a target density map, and the target density map represents the aggregation degree of the objects at different positions in the image to be identified;

wherein the method further comprises:

10. An apparatus for training a classification model of an object, comprising:

a first obtaining module, configured to obtain a training sample data set, where a training sample in the training sample data set includes a training image and tag data of the training image, where the training image represents multiple poses of an object, the object includes an animal, and the tag data includes real pose classification information for the poses;

11. An object classification prediction apparatus comprising:

12. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7 or 8-9.

13. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 7 or 8 to 9.

14. A computer program product comprising a computer program which, when executed by a processor, is adapted to carry out the method of any one of claims 1 to 7 or 8 to 9.