CN111191781A - Method of training neural network, object recognition method and apparatus, and medium - Google Patents

Method of training neural network, object recognition method and apparatus, and medium Download PDF

Info

Publication number
CN111191781A
CN111191781A CN201811349293.9A CN201811349293A CN111191781A CN 111191781 A CN111191781 A CN 111191781A CN 201811349293 A CN201811349293 A CN 201811349293A CN 111191781 A CN111191781 A CN 111191781A
Authority
CN
China
Prior art keywords
class
training
neural network
samples
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811349293.9A
Other languages
Chinese (zh)
Inventor
黄耀海
陶训强
彭健腾
邓伟洪
胡佳妮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Canon Inc
Original Assignee
Beijing University of Posts and Telecommunications
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications, Canon Inc filed Critical Beijing University of Posts and Telecommunications
Priority to CN201811349293.9A priority Critical patent/CN111191781A/en
Publication of CN111191781A publication Critical patent/CN111191781A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to a method of training a neural network, an object recognition method and apparatus, and a medium. A method of training a neural network comprising: extracting features from a first training set using an initial model of a neural network; adjusting a parameter associated with the spacing between the first class and the at least one other class according to a distribution of the at least one attribute of the classes in the first training set; calculating a loss from the adjusted parameter associated with the spacing; the initial model is updated using the calculated loss and back propagation algorithms to arrive at a trained model.

Description

Method of training neural network, object recognition method and apparatus, and medium
Technical Field
The present disclosure relates generally to object recognition, and in particular to the field of training neural networks for identifying objects.
Background
The Softmax loss function is widely used as object recognition, and the function can guarantee separability of features. However, when intra-class variation may be larger than inter-class differences (e.g., when facial recognition is performed using classes with millions of classes), the Softmax loss function may not be sufficiently effective at distinguishing features.
Recently, the mainstream method is to add a distance margin to the Softmax loss function, that is, to use the distance (margin) loss function instead of the Softmax loss function to increase the difficulty of learning, so as to force the model to continuously learn more distinctive features, thereby increasing the inter-class distance and decreasing the intra-class distance.
An article published by Deng J, Guo J, Zafeiriou S in arxiv of 2018, "ArcFace: additive angular Margin Loss for Deep Face Recognition", proposes a geometrically interpretable distance Loss function, called ArcFace (cos (θ + m)). The distance loss function directly maximizes the decision boundary (arc) of the angle based on the weights and features normalized by L2.
Disclosure of Invention
The pitch loss function in the prior art is able to distinguish features efficiently. However, in the prior art, the spacing is fixed, which makes it difficult to handle the problems of long tails (long tails, i.e. some classes have enough samples for training, while for most classes only a few samples are available for training) and low samples (low shots, e.g. each class contains one sample) (these classes are also referred to as subclasses). This is because the training set is likely to have a subclass with few training samples, so the feature space of the subclass is narrow.
For larger classes, with sufficient samples, their feature distributions resemble the true distributions in the feature space, which will make the larger classes easily identifiable. For the subclass, the feature distribution is likely to be much smaller than the true distribution in the feature space. Therefore, for the subclass, if a smaller pitch is used, erroneous recognition is easily caused.
FIGS. 1B and 1C illustrate examples of using smaller and larger pitches when training (classifying) and testing (identifying) a subclass. As shown in fig. 1B and 1C, there are two classes in the feature space, whose weight vectors are w1 and w2, respectively, where class w2 is a subclass. Filled circles 123 represent training samples of class w1, open circles 122 represent training samples of class w2, and filled stars 121 and open stars 124 represent test samples. As shown in fig. 1B, when the pitch m is small, the test specimen 124 that should originally belong to the subclass w2 may be erroneously identified as belonging to the adjacent broad class w1 when the test is performed. When the spacing m increases, as shown in fig. 1C, the test specimen 124 can be correctly identified as belonging to the subclass w2 because the subclass w2 is sufficiently spaced from the subclass w 1.
As can be seen in fig. 1B and 1C, for the long tail and low sample problem, a larger interval is needed to calculate the loss function to train the neural network model.
In addition, the distinguishing difficulty is high for some special classes (such as classes with small distance between classes, such as face images of twins), while some classes with higher importance (such as face images of people who frequently use a certain system) have higher requirements on the accuracy of identification, and the use of a fixed distance is not enough to realize effective distinguishing of classes and high-accuracy identification of objects.
Therefore, there is a need for an adaptive training method that can adaptively adjust the spacing according to the distribution of attributes of classes in the training set to calculate the loss for training the neural network.
According to an aspect of the present disclosure, there is provided a method of training a neural network, the method comprising: extracting features from a first training set using an initial model of a neural network; adjusting a parameter associated with the spacing between the first class and the at least one other class according to a distribution of the at least one attribute of the classes in the first training set; calculating a loss from the adjusted parameter associated with the spacing; the initial model is updated using the calculated loss and back propagation algorithms to arrive at a trained model.
According to another aspect of the present disclosure, there is provided a method of training a plurality of neural network models, the method comprising: obtaining a plurality of different training sets from a base training set; for each of a plurality of different training sets, training is performed using the methods described in accordance with any one or any combination of the embodiments of the present disclosure to arrive at a plurality of different trained models.
According to another aspect of the present disclosure, there is provided an object recognition method comprising performing object recognition using a trained model obtained according to the method of training a neural network described in any one of the embodiments of the present disclosure.
According to yet another aspect of the present disclosure, there is provided an apparatus for training a neural network, the apparatus comprising a processor and a memory, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method according to any one or any combination of the embodiments of the present disclosure.
According to still another aspect of the present disclosure, there is provided an object recognition apparatus including: an apparatus for training a neural network as described in embodiments of the present disclosure; a feature extraction device configured to extract features from an input object using a trained model; and similarity calculation means configured to calculate a similarity between the extracted feature and a feature of the real object.
According to yet another aspect of the present disclosure, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method described according to any one or any combination of the embodiments of the present disclosure.
Drawings
The above and other objects and advantages of embodiments of the present disclosure are further described below in conjunction with the specific embodiments, and with reference to the accompanying drawings. In the drawings, the same or corresponding technical features or components will be denoted by the same or corresponding reference numerals.
Fig. 1A shows a flow diagram of a method of training a neural network according to a first embodiment of the present disclosure.
FIG. 1B shows a schematic diagram of the use of smaller pitches in training and testing a subclass.
FIG. 1C shows a schematic diagram of the use of larger pitches in training and testing a subclass.
Fig. 2 shows a block diagram of one example of a training method according to a first embodiment of the present disclosure.
Fig. 3 shows a flow chart for dynamically adjusting the pitch using the sample number distribution of classes according to a second embodiment of the present disclosure.
Fig. 4 shows a specific example of the second embodiment.
Fig. 5 shows a schematic diagram of one example of a function according to a third embodiment of the present disclosure.
Fig. 6A shows a schematic diagram of a deep Q learning network according to a fourth embodiment of the present disclosure.
Fig. 6B shows a flowchart of an example of a training method for adjusting a pitch by using DQN according to a fourth embodiment of the present disclosure.
Fig. 6C shows a flowchart of an example of a method of collecting samples using a pre-trained neural network according to a fourth embodiment of the present disclosure.
Fig. 6D shows a flow chart for pitch adjustment using DQN according to a fourth embodiment of the disclosure.
Fig. 6E shows a flowchart of an example of a method of calculating a loss using an adjusted spacing according to a fourth embodiment of the present disclosure.
Fig. 7A shows an example of a flow chart of a method of feature augmentation using neighbor distribution according to a fifth embodiment of the present disclosure.
Fig. 7B shows a schematic diagram of a method of feature augmentation using neighbor distribution according to a fifth embodiment of the present disclosure.
Fig. 8A shows a flowchart of a method of feature augmentation using gaussian distribution according to a sixth embodiment of the present disclosure.
Fig. 8B shows a schematic diagram of a method of feature augmentation using gaussian distribution according to a sixth embodiment of the present disclosure.
Fig. 9A shows a flowchart of one example of a method of training a neural network according to a seventh embodiment of the present disclosure.
Fig. 9B shows a flowchart of another example of a method of training a neural network according to a seventh embodiment of the present disclosure.
Fig. 9C shows a flowchart of another example of a method of training a neural network according to a seventh embodiment of the present disclosure.
Fig. 10A shows a flowchart of an example of a method of calculating a loss from noise types and spacings in a training set according to an eighth embodiment of the present disclosure.
Fig. 10B shows a schematic diagram of an example of a combined training method according to a seventh embodiment and an eighth embodiment of the present disclosure.
FIG. 11 shows a schematic diagram of classes in a feature space.
Fig. 12 shows a schematic diagram of the distribution of the inner products of the weights and the inner product of the random vector of the last fully-connected layer in a neural network trained by Arcface according to the ninth embodiment of the present disclosure.
Fig. 13 illustrates an example flow diagram for training a neural network model according to an orthogonal loss function in accordance with a ninth embodiment of the present disclosure.
Fig. 14 shows a flow chart of an example of a multi-model training method according to a tenth embodiment of the present disclosure.
Fig. 15 shows a flow chart of an example of a training method according to an eleventh embodiment of the present disclosure.
Fig. 16 shows a flowchart of an example of an object recognition method according to a twelfth embodiment of the present disclosure.
Fig. 17 shows a block diagram of an apparatus for training a neural network according to a thirteenth embodiment of the present disclosure.
Fig. 18 shows a block diagram of an object recognition apparatus according to a fourteenth embodiment of the present disclosure.
Fig. 19 shows a block diagram of a hardware configuration of a computer system capable of implementing a fifteenth embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an embodiment have been described in the specification. It should be appreciated, however, that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with device-related and business-related constraints, which may vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
Here, it should also be noted that, in order to avoid obscuring the present disclosure with unnecessary detail, only process steps and/or system structures germane to at least the scheme according to the present disclosure are shown in the drawings, and other details not germane to the present disclosure are omitted.
Next, description is made in the following order.
First embodiment
Neural networks are a type of simulation and abstraction of human brain neuronal networks using mathematical models. The neural network model includes a plurality of neurons (nodes), and the nodes can be divided into an input layer, a hidden layer and an output layer. Nodes of adjacent layers can be connected with each other. The connections between nodes are typically represented by lines with arrows indicating weight values. Thus, the output of the neural network can be expressed according to the connection relationship between the nodes and the weight value.
In the neural network, the error between the actual output and the target output is subjected to back propagation by using a back propagation algorithm, so that the weight value in the neural network model is continuously updated, the obtained output error is smaller and smaller, and effective learning is realized. The neural network can be applied to various fields such as clustering and object recognition.
The structure of the neural network may be, for example, VGG16, ResNet, SENet, or the like.
A flow chart of a method of training a neural network according to a first embodiment of the present disclosure is first described with reference to fig. 1A. In the method, an initial model of the neural network is trained using a first training set, and parameters associated with the spacing are adaptively adjusted according to at least one attribute of each class of the training set to calculate a loss and obtain a trained neural network model.
In step S101, features are extracted from a first training set using an initial model of a neural network.
Here, the first training set may be a set of data carrying the object to be trained. Preferably, the first training set may be, for example, a set of pictures, such as portrait pictures, car pictures, animal pictures, and the like. Alternatively or additionally, the first training set may be a set of sounds, such as a set of songs, a set of dialects, a set of animal vocals, and so on. Preferably, the first training set is a set of face pictures.
The object may be, for example, a human face, a human body, an object, a sound, or an attribute, which may be, for example, age, gender.
The features are abstract representations of the raw data in the first training set, and the features may be represented as vectors, for example.
The feature space is the space in which the features exist. Preferably, the feature space may be represented by a hypersphere formed by normalizing each feature vector.
In step S102, a parameter associated with a distance between the first class and at least one other class is adjusted according to a distribution of at least one attribute of the respective classes in the first training set.
A class is a collection of all samples of the same object or some objects with the same or similar properties. Preferably, the class may be a picture of the same person in the set of pictures of the face. Alternatively or additionally, the class may be a picture of some people in the first training set.
Preferably, the first class is any class in the first training set. Alternatively or additionally, the first class is any class in a subset of the first training set.
Preferably, the attribute of the class may be a sample number of the class. Alternatively or additionally, attributes of a class may include the difficulty of the class, or the importance of the class, and so forth.
The feature space of a class refers to a subspace defined by all samples of the class in the feature space.
Pitch refers to the angle between two classes of feature spaces in the feature space. For example, preferably, the parameter associated with the spacing may be the spacing itself. Alternatively or additionally, the parameter associated with the spacing may be a class center similarity between the classes.
In step S103, the loss is calculated from the adjusted parameters associated with the spacing.
In the neural network, the loss is obtained by comparing the output of the last layer with a target output (GT, group route). The loss may be expressed as a function, for example. The loss function may be a logarithmic loss function, a quadratic loss function, an exponential loss function, a Hinge loss function, etc., according to various needs. Preferably, the loss can be calculated using a pitch loss function. The distance loss function may be SphereFace, CosinesFace or ArcFace, among others. Alternatively or additionally, the loss may be calculated using the class center similarity.
In step S104, it is determined whether the loss satisfies a predetermined condition, and if not, the process proceeds to step S105. If so, the process ends.
In step S105, the neural network model is updated using the calculated loss and back propagation algorithms to obtain a trained model.
Fig. 2 shows a block diagram of one example of a training method according to a first embodiment of the present disclosure.
As shown in fig. 2, in one aspect, an input image 202 is obtained from a first training set 201 through preprocessing, and features are extracted through a neural network 204 to obtain a feature map 205. On the other hand, the spacing is adaptively adjusted according to the distribution of at least one attribute of each class of the training set to obtain an adjusted spacing 203. The loss 206 is then calculated, for example, using the pitch loss function ArcFace. Thereafter, parameters in the neural network are updated using a back propagation algorithm 207 based on the calculated losses, resulting in a trained neural network model.
It can be seen that the present disclosure adaptively adjusts the spacing according to the distribution of at least one attribute of a class in a training set, thereby being able to avoid the disadvantage of fixed spacing in the prior art, improving the accuracy of training neural networks, and the spacing adjustment during training is particularly suitable for unbalanced training data with long tails and low samples. Therefore, object recognition using a neural network can be performed more accurately.
Second embodiment
Fig. 3 shows a flow chart for dynamically adjusting the pitch using the sample number distribution of classes according to a second embodiment of the present disclosure. In this embodiment, the attribute may be the number of samples of the class, the parameter associated with the spacing may be the spacing itself, and the value of the spacing is directly determined according to the size of the number of samples of the class. The method comprises the following steps:
s301, for the first training set, counting the number of samples of each class.
S302, dividing all classes into N parts according to the size of the number of samples of the classes.
S303, defining a space for each part. Wherein a larger pitch value is defined for classes with a small number of samples and a smaller pitch value is defined for classes with a large number of samples, i.e. the pitch is inversely related to the number of samples.
Fig. 4 shows a specific example of the embodiment. In this example, the horizontal axis represents the number of samples, the vertical axis represents the pitch m, and all classes are divided into 3 parts:
1) when the number of samples of the class is more than or equal to 100, making m equal to 0.3;
2) when the number of class samples is less than 100 and greater than or equal to 50, making m equal to 0.4;
3) when the number of samples of the class is less than 50, let m be 0.5. Thereby enabling adaptive adjustment of the pitch according to the number of samples.
Subsequent steps S103-S105 in fig. 1 may be similarly performed after adjusting the pitch. In this embodiment, the loss is preferably calculated using a pitch loss function from the adjusted pitch m. In this embodiment and the subsequent embodiments, the differences from the previous embodiments are mainly described and the description of the parts overlapping with the previous embodiments is not repeated.
Therefore, the embodiment considers the distribution of the number of samples of the classes in the process of training the neural network, uses different distances for different classes, and improves the correct recognition rate of the training model.
Table 1 shows the comparison of the correct recognition rate of the model obtained by the training method of the above example according to the second embodiment of the present disclosure and the correct recognition rate of the prior art method. Wherein the models obtained by the method of the second embodiment of the present disclosure and the prior art method were tested using faceScrub and FGNet test sets, 106Is the size of the galery database and Rank1 refers to the first candidate. It can be seen that the correct recognition rate of the method of the present disclosure is superior to the prior art methods when tested using the FaceScrub and FGNet test sets.
Algorithm (LResNet50E-IR) Rank1@106(FaceScrub) Rank1@106(FGNet)
Prior Art Arcface(m=0.5) 77.1% 58.26%
Second embodiment Second example (m ═ 0.3,0.4,0.5) 79.35% 60.10%
TABLE 1 comparison of the results of the process of the second example and the prior art
Third embodiment
In this embodiment, the attribute may be the number of samples of the class, the parameter associated with the spacing may be the spacing itself, and a function is defined to dynamically adjust the spacing, where there is a negative correlation between the spacing and the number of samples of the class.
For a class i, its spacing m (i) can be defined as follows:
m(i)=A–B*f(Ni/C),(1)
here, Ni is the number of samples of class i, and a, B, and C are constants, which can be obtained by experiment or experience. f is a function.
Fig. 5 shows an example of this embodiment. In this example, a is 0.5, B is 0.1, C is the maximum (max) of the number of samples of all classes, min is the minimum of the number of samples of all classes, and the function f is chosen to be the arcsin function. In fig. 5, the abscissa represents the number of samples within a class and the ordinate represents the pitch value, and it can be seen that as the number of samples increases, the pitch decreases accordingly.
Subsequent steps S103-S105 in fig. 1 may be similarly performed after adjusting the pitch. Compared with the second embodiment, the method of the embodiment can adjust the spacing m more flexibly, thereby obtaining more accurate training and recognition results.
Fourth embodiment
In this embodiment, the attribute may be the number of samples of the class, the parameter associated with the distance may be the distance itself, and the distance is automatically learned and adjusted using a Deep Q-learning network (DQN) method.
In DQN, Q(s)i,ai) Is defined as being in the current state siTake some action aiThe cumulative reward obtained. r isiIs shown in the current state siTake some action aiThe immediate reward obtained.
In training, the Q function may be iteratively updated using the following formula:
Figure BDA0001864498940000101
wherein s'i、a′iThe learning parameter γ is, for example, a constant satisfying 0 ≦ γ ≦ 1, representing the next state and action. The index i denotes the yiAnd (4) class. When the value of gamma approaches 0, it means that the immediate return is mainly considered, and when the value of gamma approaches 1, it means that the future return is also considered. Specific algorithms for DQN can be found in: human-level Control through Deep repair learning published in Nature 2015 by MniH V, Kavukuoglu K and Silver D et al.
In the present embodiment, DQN can be used to determine the variation trend of the pitch according to the current pitch, the number of samples of the class, the intra-class variance, and the inter-class distance. One non-limiting example of this embodiment is described below.
In the present embodiment, the input of DQN can be expressed, for example, as {(s)i,ai,ri,s′i)}. The status, action and immediate reward of the DQN can be determined separately, for example, using the following means.
Status of state
It is contemplated that the spacing required for each class may be affected by the number of samples in the class and the intra-class variance. Thus, the state s can be madeiIncluding parameters related to the current spacing, number of samples of the class, and intra-class variance. Wherein all classes are divided into groups according to their sample number and intra-class variance. Usage value
Figure BDA0001864498940000111
To represent the group number in the group set G, wherein
Figure BDA00018644989400001113
While
Figure BDA00018644989400001114
Denotes the y thiNumber of samples in a class.
Figure BDA0001864498940000112
Denotes the y thiThe intra-class variance of a class and can be formulated as follows:
Figure BDA0001864498940000113
wherein the content of the first and second substances,
Figure BDA0001864498940000114
here, the first and second liquid crystal display panels are,
Figure BDA0001864498940000115
is the yiFeatures of the jth sample in the class extracted by the trained neural network and normalized by L2. The trained neural network is obtained by pre-training using a fixed pitch loss function.
For group
Figure BDA0001864498940000116
By using the current neural network to calculate the intra-class variance and obtain a mean value Vi。ViCan be formulated as follows:
Figure BDA0001864498940000117
wherein N isiPresentation group
Figure BDA0001864498940000118
Number of middle classes, and
Figure BDA0001864498940000119
Figure BDA00018644989400001110
here, xjAre features extracted by the current neural network and normalized by L2. To make the state emptyM is discrete, mi(t) e M and defining a function f (V)i) E.g. F to quantize ViWherein
Figure BDA00018644989400001112
Thus, siCan be expressed as
Figure BDA00018644989400001111
Movement of
In the present embodiment, in each state siNext, there may be three actions ai
(1) Keeping m unchanged;
(2) increasing m by a fixed step length; and
(3) m is decreased by a fixed step size.
All actions and the rewards associated with each action will be considered in the training in order to get a better decision through the training.
Preferably, the fixed step size may be 0.25. But the value of the fixed step size is not limited thereto.
Real-time reporting
In this embodiment, in order to make each class in the training set better distinguishable, r is reported in real timeiFor example, a function that is negatively correlated with intra-class variance and positively correlated with inter-class distance, such that when the inter-class distance is larger and the intra-class variance is smaller, the immediate return is greater.
To this end, an R function may first be defined:
Ri=αDi-βVi(8)
the R function is the y-th functioniClass in state siA measure of the distribution of the intra-class and the inter-class distributions, where α and β are empirically determined values, DiIs the inter-class distance and can be formulated as:
Figure BDA0001864498940000121
wherein the content of the first and second substances,
Figure BDA0001864498940000123
representing the sum of the last layer of the neural network and the yiThe normalized weight vector of the class correlation is,
Figure BDA0001864498940000122
represents the sum of distance yiClass-nearest class-related normalized weight vector.
Thus, the immediate reward function r can beiIs defined as:
ri=R′i-Ri(10)
wherein R isiAnd R'iRespectively with the R function in the state siAnd s'iThe following values.
Note that riAnd is not limited to the above form. For example, riMay be positively correlated only with the inter-class distance, or may be negatively correlated only with the intra-class variance, or may be correlated with other variables, or may take any other form as desired.
Fig. 6A shows a schematic diagram of the training method according to the present embodiment.
As shown in fig. 6A, a neural network 605 is pre-trained according to a fixed spacing 601 using a training set 604. The intra-class variance 603 is calculated using a pre-trained neural network 605 and the classes are grouped according to the number of intra-class images and the intra-class variance 603. The samples to be input into the neural network DQN are then obtained 606 from the spacing, number of samples, intra-class variance 603 and inter-class distance 602: current state 607, current action 608, immediate reward 609, and next state 610.
In DQN, training 611 is performed according to the above inputs, the obtained outputs 612-614 are the values of the actual cumulative return obtained when taking any of the three actions 612-614 (i.e., m increasing, m decreasing, m unchanged) in the current state 607, respectively.
In each iteration, by making the actual cumulative return as close as possible to the target cumulative return
Figure BDA0001864498940000131
(e.g., by calculating losses using the difference of the actual cumulative return and the target cumulative return and back-propagating) to train 611 the DQN. The adaptive spacing loss network is then trained 615 according to the adaptive spacing loss function.
Note that, in the present embodiment, the kinds of input and output of DQN may be increased or decreased as necessary, and are not limited to the specific forms given above.
Preferably, after the pitch is adjusted, the pitch loss function can be calculated, for example, using the following formula:
Figure BDA0001864498940000132
wherein
Figure BDA00018644989400001311
Represents the feature of the ith sample, and xiBelong to the yiAnd (4) class. The number of samples and the number of classes for a batch are N and N, respectively.
The function P can be expressed by the following formula:
Figure BDA0001864498940000134
wherein the content of the first and second substances,
Figure BDA0001864498940000135
Figure BDA0001864498940000136
representing weights in the last fully-connected layer in a neural network model
Figure BDA0001864498940000137
Figure BDA0001864498940000138
Column j.
Preferably, the function P*May be in the form of, for example, a CosineFace or ArcFace function. In particular, for CosinesFace, function P*Can be expressed by the following formula:
Figure BDA0001864498940000139
for ArcFace, function P*Can be expressed by the following formula:
Figure BDA00018644989400001310
in the embodiment of the present disclosure, m in the above two formulas is the distance m that is adaptively adjusted in the abovei(t)。
Note that the pitch loss function may take various forms, and is not limited to those shown above.
Fig. 6B shows a flowchart of an example of a training method of adjusting a pitch by using DQN according to the present embodiment.
As shown in fig. 6B, in step S621, a sample is collected using a pre-trained neural network;
in step S622, using the collected samples as input, the DQN parameters are trained;
in step S623, the adaptive distance loss network is trained according to the adaptive distance loss function.
Fig. 6C shows a flowchart of an example of a method of collecting samples using a pre-trained neural network according to the present embodiment.
As shown in fig. 6C, in step S631, preprocessing is performed. Performing preprocessing may include: the neural network is pre-trained using a fixed distance loss function, intra-class variances are calculated using the pre-trained neural network, and classes are grouped according to the number of intra-class images and the intra-class variances.
In step S632, a sample is collected. Collecting the sample may include, for example: for each of the classes in the group g,calculating the current state s from the last neural networkiThen each current action a in the action space is performediTo modify the spacing in group g, then for each action train the last neural network again using the modified spacing, then calculate from one training the next state s 'to which group g transitions'iThen, the immediate reward r is calculated using the last neural network and the current neural networki(ii) a Finally, the current state s is comparediCurrent action aiReal-time report riAnd the next state s'iRecorded as a sample.
Fig. 6D shows a flow chart of training DQN parameters according to the present embodiment.
As shown in FIG. 6D, in step S641, the data (S) is propagated forwardi,ri);
In step S642, DQN output (i.e., actual cumulative reward) is obtained;
in step S643, for each action a 'in the action space'iPropagating forward (s'i,a′i);
In step S644, a target cumulative reward r is calculatedi(target output); the target cumulative reward may be calculated, for example, using equation (2);
in step S645, the difference between the actual cumulative reward and the target cumulative reward is optimized;
in step S646, parameters in the DQN are updated according to the result of the optimization.
Fig. 6E shows a flowchart of an example of a method of training an adaptive spacing loss network according to an adaptive spacing loss function according to the present embodiment.
As shown in fig. 6E, in step S651, a state-action table is prepared, and this operation can be completed by inputting each state in the state space into the trained DQN model to obtain the corresponding output action. Preferably, the action with the largest value among 612, 613 and 614 is selected as the output action.
In step S652, an adaptive span loss network is trained. The operations may include: carrying out normal deep CNN training; at the end of each training, calculating the current state for each group of classes; the state-action table is then queried and the associated actions are performed to modify the spacing of each group.
By adjusting the spacing using DQN, the spacing value can be changed at each iteration, and inter-class distribution and intra-class distribution, etc. are taken into account, and therefore, higher accuracy can be obtained compared to other methods.
Fifth embodiment
In this embodiment, the attribute may be the number of samples of the class, the parameter associated with the pitch may be the pitch itself, and for classes with smaller numbers of samples, the pitch may be adjusted by increasing the number of samples of the class by feature augmentation.
Preferably, feature augmentation is performed using neighbor distribution, and fig. 7A shows an example of a flow chart of a method of feature augmentation using neighbor distribution. As shown in fig. 7A, the steps of the method are as follows:
in step S701, a large class closest to the first class is found, the number of samples of which is higher than a predetermined threshold.
In step S702, the center of the feature space of the large class is found, and the residuals between the features and the center of all samples of the large class are obtained.
In step S703, the residual distribution is counted.
In step S704, the residual distribution within the predetermined range is retained.
In step S705, the obtained residual distribution is mapped to the class with the smaller number of samples, and the augmented feature is obtained.
It can be seen that in this embodiment, feature augmentation is performed for a class with a smaller number of samples, and the feature distribution of another class with a larger number of samples is utilized. After feature augmentation, the pitch of the class with the smaller number of samples can be calculated, for example, according to the following formula:
Figure BDA0001864498940000161
wherein S is aRadius of spatial hyper-sphere, riIs a parameter indicating a range. Since the spacing m needs to be positive, m' is set to a fixed positive number, and ri<S m'. Thereby, the adjusted pitch can be obtained.
Fig. 7B shows a schematic diagram of the method. As shown in FIG. 7B, dark filled circles 722 represent samples in the subclass for which feature augmentation is desired, light filled circles 721 represent samples in a large class near the subclass, and the range r is truncated by mapping the sample distribution of the large class to the subclassiInner distribution, a small class of samples 723 (shown using dashed lines) with an augmented signature can be obtained.
According to the fifth embodiment, for a class in which the number of samples is small, for example, a class containing only one sample, feature augmentation and setting range are used to appropriately expand the number of samples in the class. M is appropriately adjusted to be reduced as compared with the case where the feature augmentation is not performed, thereby improving the convergence speed of the training process.
Sixth embodiment
This embodiment is another embodiment of feature augmentation that does not use another feature distribution of a class with a larger number of samples. Preferably, feature augmentation is performed using a gaussian distribution. Fig. 8A shows a flow chart of a method of feature augmentation using gaussian distributions. As shown in fig. 8A, the steps of the method are as follows:
in step S801, the mean and variance of the gaussian distribution are set.
In step S802, the feature center of the gaussian distribution is obtained using all the features of the class having the smaller number of samples.
In step S803, a new feature within a predetermined range is generated.
After the feature enlargement, the pitch may be adjusted using a method similar to that of the fifth embodiment.
Fig. 8B shows a schematic diagram of the method. As shown in fig. 8B, similar to the neighbor distribution, by mapping the gaussian distribution to the subclass and truncating the distribution within the range r, a sample 821 of the subclass with augmented features can be obtained.
Note that the method of feature augmentation is not limited to the two methods described above, but any suitable feature augmentation method may be employed as needed, for example, for the fifth embodiment, a similarity distribution is used instead of a neighbor distribution or an arbitrary class with a larger number of samples is selected; or for the sixth embodiment, instead of using a gaussian distribution, other distributions in statistics are used. Also, the method of adjusting the pitch after feature augmentation is not limited to the illustrated formula, but any suitable formula may be employed to adjust the pitch as needed.
Seventh embodiment
In this embodiment, the attribute may be a difficulty of the class, and the difficulty distribution of the class is used to adjust a parameter associated with the spacing. For example, the parameter associated with the spacing may be a class center similarity. Furthermore, in the present embodiment, an additional loss function is used instead of the pitch loss function.
Preferably, the difficulty distribution of the class may be determined manually, for example by directly specifying the difficulty of the class. Alternatively or additionally, the difficulty distribution of the classes may be determined based on class-to-class center distance or class-to-center similarity between the classes, for example, the difficulty value of the class with a closer center distance (e.g., a twin, or a similar face image, etc.) is larger.
Preferably, for classes with difficulty above a predetermined threshold, the penalty is calculated based on class center similarity.
Fig. 9A shows a flowchart of an example of training a model according to the class center similarity according to the present embodiment.
As shown in fig. 9A, in step S901, n samples are taken for each class.
In step S902, the similarity of each of the two classes is calculated using the selected n samples. Preferably, the similarity may be expressed by a class center similarity or a distance of feature centers of the classes, but is not limited thereto.
Preferably, class-centric similarity S between class i and class ji,jCan be expressed by the following formula:
Figure BDA0001864498940000171
wherein, CiAnd CjRepresents the center of the normalized features of class i and class j, and can be represented using the following formula:
Figure BDA0001864498940000181
wherein x istIs a feature of the t-th sample randomly selected from class i with niOne sample and a fixed number of n samples are randomly selected. Thus, even if a certain noise exists in the training set, CiAnd CjAnd is also relatively robust.
It can be seen that the similarity S is in the class centeri,jIn (1), the greater the similarity between classes i, j, Si,jThe smaller the value of (a).
In step S903, m classes corresponding to the highest similarity (S) are selected as the classes with higher difficulty. Preferably, in this step, the similarities are sorted, for example, in order from big to small. Then, the class corresponding to the first similarity (highest similarity) in the similarity sequence is placed into the set of classes with higher difficulty. Then, the class corresponding to the second similarity (the second highest similarity) in the similarity sequence is placed into the set of classes with higher difficulty. And so on until there are m classes in the set of more difficult classes.
In step S904, the model is trained using the more difficult class and the central dispersion loss function.
Preferably, the center dispersion loss function can be expressed as a function of the center-like similarity:
Figure BDA0001864498940000182
wherein, and
Figure BDA0001864498940000183
indicating a large class center similarity of the kth.
Alternatively, the center dispersion loss function may be expressed as a function of the center-like distance, or as any other function capable of performing the above-described function.
It can be seen that the total cost (i.e. loss) is the average of the similarity calculated for the m classes combined pairwise.
Therefore, the training method according to the embodiment can effectively extract the classes with high difficulty (such as quite similar face images) and distinguish the classes, so that the training accuracy is improved.
Note that although the central dispersion loss function is used to train a class with a large difficulty in the present embodiment, the application of the central dispersion loss function is not limited thereto, but may be used in training any class as needed, for example, in training a class with a small number of samples.
Fig. 9B shows a flowchart of another example of a method of training a neural network according to a seventh embodiment of the present disclosure.
As shown in fig. 9B, the steps of the method are as follows:
in step S911, the first training set is divided into two parts.
Preferably, the first part consists of less difficult classes and the second part consists of more difficult classes (i.e. the harder classes are extracted from the first training set for further processing).
Preferably, the difficulty of the class can be measured by calculating the class center similarity.
Note that the manner of dividing the first training set is not limited to the method shown above. For example, alternatively or additionally, classes in the first training set having a center-like similarity greater than a predetermined threshold may be classified as the second part, and parts having a center-like similarity less than or equal to the threshold may be classified as the first part.
Note that although it is described in the present embodiment that the first training set is divided by the difficulty of the class, the division criterion of the first training set is not limited thereto, but may be divided by any criterion as necessary. For example, the first training set may be divided into a first part and a second part according to a distribution of the number of samples within the class.
After the division into the first training set is completed in step S911, the process proceeds to step S912.
At step S912, for the first part, because of the lesser difficulty, the model of the neural network may be trained using a training algorithm according to any other embodiment of the present disclosure or any known training algorithm. And then proceeds to step S913.
Since the class of the second part is difficult, it is necessary to use a method capable of effectively distinguishing the classes. Therefore, in step S913, for the second part, the model of the neural network is trained by using the central dispersion loss function using the trained model obtained in step S912 as an initial model.
Fig. 9C shows a flowchart of another example of a method of training a neural network according to a seventh embodiment of the present disclosure.
As shown in fig. 9C, there is a first training set and a second training set, where each class in the second training set is a harder class. Preferably, the classes in the second training set can be extracted from some base training set using the method described in fig. 9A, but is not limited thereto. Alternatively, the classes in the second training set may be classes in the base training set having a difficulty above a predetermined threshold. Further, any suitable method may be used to obtain the second training set as desired.
Preferably, the base training set may be the first training set. Alternatively, the base training set may partially coincide with the first training set (the intersection is not empty). Alternatively, the base training set may be completely different from the first training set (the intersection is empty). For example, the second training set may be composed of images of people that are completely different from the people in the first training set.
The method comprises the following steps:
in step S921, for the first training set, training may be performed using a training algorithm according to any other embodiment of the present disclosure or any known training algorithm, resulting in a trained model.
In step S922, for the second training set, since the difficulty of class is large, the trained model obtained in step S921 may be used as an initial model to train by calculating a loss using the central dispersion loss function.
Eighth embodiment
In this embodiment, how to remove the influence of noise in the training set, thereby improving the accuracy of training, is discussed. Noise is often present in the training set and can affect the accuracy of the training. For this reason, a function with higher robustness needs to be used to calculate the loss.
In the present embodiment, in addition to the parameters associated with the spacing in the foregoing embodiments, the loss is calculated from the type of noise in the training set.
There are three main types of noise in the training set:
the first type: the mark is reversed. In this case, the images in the training set would have belonged to one class, but have been incorrectly labeled as another class.
The second type: an outlier. In this case, the image does not actually belong to any class in the training set. Only because the image is very similar to a class in the training set, it is predicted by the model as the similar class at the time of training. For example, the training set should originally include only one of the twins, while the image of the other of the twins is incorrectly labeled into that class of the training set.
The third type: complete dirty data. In this case, the image is erroneously labeled as a class in the training set, and in fact, the image does not belong to any class in the training set, nor even can it be any sample in the object recognition. For example, the noise in the image is too large due to the image being too blurred, the image resolution being too low, etc., and thus cannot be used as a sample of a certain class at all. Therefore, the mis-marked non-sample image is considered as noise. The third type of noise differs from the second type of noise in that the third type of noise is completely dirty data and cannot be included in any class in the training set during the training process.
It can be found experimentally that when the model is trained well, the predicted probability P of a class in the training set can be used to mask a third type of noise, i.e., "completely dirty data". Since the prediction probability P of a class of the third type of noise is very small when the model has been trained sufficiently, while this property does not occur in other types of noisy data and clean data. Thus, the third type of noise can be eliminated by gradually removing the data where this effect occurs during the training process.
In one embodiment, the first type of noise, the second type of noise, and the correctly labeled training data may be distinguished only by the prediction probability P.
In another embodiment, the above two types of noise may be reduced by the following method.
It can be found experimentally that the labeling of the first type of noise and the second type of noise may be very inconsistent with the model prediction. Furthermore, as training progresses, confidence in the model prediction may be increased, in this way, the impact of the first and second types of noise in the training set may be reduced.
Therefore, in consideration of the above factors, the interference of the above three types of noise can be reduced by dynamically adjusting the learning criterion as described below.
at the same time, the learning criteria assume that if the third type of noise is not contained in the training set, then the "true" label is a convex combination of the original label with a probability α and the current prediction class with a probability (1- α).
Then, the noise suppression loss function for reducing the influence of noise can be expressed using the following formula:
Figure BDA0001864498940000211
where N is the number of training samples in a batch, α (P) and β (P) control the degree of combination, and
Figure BDA0001864498940000221
the parameters α and t are set in stages during the training, i.e. α is set to 1 and t is set to 0 at the beginning of the training, the model itself can distinguish noise when it is trained relatively well, so α is reduced slightly, e.g. to 0.9, and t is set to a small positive number, e.g. 0.001.
Figure BDA00018644989400002212
Is the prediction probability of the class of "true" (corresponding to the initial label), and
Figure BDA00018644989400002213
is the prediction probability of the current prediction class. Specifically, the loss can be calculated using a noise suppression Softmax loss function:
Figure BDA0001864498940000222
wherein
Figure BDA0001864498940000223
Features of the i-th sample, yiIs a "true" training marker.
Figure BDA0001864498940000224
Is a current prediction flag, which can be expressed by the following formula:
Figure BDA0001864498940000225
during the training process, the feature dimension d may be set to 512, for example.
Figure BDA0001864498940000226
Representing weights in the last fully-connected layer of a neural network
Figure BDA0001864498940000227
To (1) aj columns, n is the number of classes in the training set,
Figure BDA0001864498940000228
is the bias term.
Preferably, the noise suppression loss function can be calculated, for example, using the Arcface function:
Figure BDA0001864498940000229
Figure BDA00018644989400002210
wherein | | | xiI is scaled to the hyper-sphere radius s, m is the pitch.
Thus, the loss can be calculated from the noise type and the spacing in the training set.
Fig. 10A shows a flowchart of an example of a method of calculating a loss from the noise type and the spacing in the training set according to the present embodiment.
In step S1001, features are extracted from a first training set using an initial model of a neural network.
In step S1002, the type of noise in the training set is analyzed.
In step S1003, it is determined whether or not the noise in the samples of the training set is the third type of noise. If so, the sample is discarded in step S1004. If not, the process proceeds to step S1005.
In step S1005, a noise suppression loss function is calculated from a combination of the prediction probability of the "true" mark and the prediction probability of the current prediction mark using the adjusted pitch.
In step S1006, the initial model is updated according to the calculated loss using a back propagation algorithm, thereby obtaining a trained model.
Note that this embodiment may be combined with any embodiment of the present disclosure, that is, the adjusted pitch obtained by any embodiment of the present disclosure may be substituted into the pitch m in this embodiment, and then the loss may be calculated by a noise suppression loss function, thereby achieving a noise suppression effect.
Preferably, the model obtained by the training method of the present embodiment may be used as the initial model in step S922 or step S913 in the seventh embodiment to train the class whose difficulty is higher than the threshold. Since the noise interference is eliminated in the initial stage, the training method can further restrain the noise in the training set and obtain an accurate and trained model.
Example of combination of the seventh embodiment and the eighth embodiment
Fig. 10B shows a schematic diagram of an example of a method of training a neural network of the combination of the seventh embodiment and the eighth embodiment of the present disclosure.
As shown in fig. 10B, classes in the training set are divided 1013 into a first part (head data) 1011 and a second part (tail data) 1012. In this example, the head data 1011 is a portion where the number of intra-class samples is large, and the tail data 1012 is a portion where the number of intra-class samples is small. But is not limited thereto.
The head data 1011 is first used for training, resulting in a base model, and then the resulting base model is used as two identical initial models 1016 and 1017, with the head data 1011 being input 1014 into the initial model 1016 and the tail data 1012 being input 1015 into the initial model 1017. The model 1016 with the header data 1011 is trained using the noise suppression loss function 1018 described in the eighth embodiment, the model 1017 with the tail data 1012 is trained using the central dispersion loss function 1019 described in the seventh embodiment, and the hard class is extracted using the method described in the seventh embodiment with reference to fig. 9A.
Preferably, in training the model using the tail data, the tail data is gradually input 1015 into the model 1017 in an iterative manner.
Preferably, the weights are shared 1020 between the two models in order to make the models stable and both models able to achieve optimal performance.
By using the method described in the above example, a model that can suppress noise and effectively distinguish difficult classes at the same time can be obtained, and the model training efficiency and the recognition accuracy are greatly improved.
Ninth embodiment
In this embodiment, the attribute may be the importance of the class, and in this embodiment, the importance distribution of the class is used to adjust a parameter associated with the spacing, which may be, for example, a relationship between feature spaces.
Preferably, the importance distribution of the classes can be determined manually. Alternatively, the importance distribution of a class may be determined based on the queried frequency of the class, e.g. people belonging to a certain class frequently come in and go out of a certain location and thus are queried more frequently, thus giving the class higher importance.
It is known that in a p-dimensional space, there are p vectors that are orthogonal to each other. When two vectors are orthogonal to each other, the degree of discrimination between them is better. Although the classes cannot all be made orthogonal because there are thousands of classes in the training set, the classes with importance above the threshold may be trained using an orthogonal method.
Preferably, the penalty may be further calculated by making the feature space of the class having an importance higher than a predetermined threshold value as orthogonal as possible to the feature space of any other class than the class. As shown in fig. 11, for a class 1101 and other classes 1102, 1103 whose importance in the feature space is higher than a threshold value, θ is defined as an angle between the weights of the class 1101 and the class 1102. Class 1101 can be better distinguished from class 1102 when θ approaches π/2. The same is true for classes 1101 and 1103.
The loss may be further calculated, for example, by an orthogonal loss function.
Preferably, the orthogonal loss function makes the weight vectors (which represent the centers of the classes) of the last fully-connected layer in the neural network as orthogonal as possible by using first and second moments. Preferably, the weights can also be made as orthogonal as possible by using higher order moments. In the following example, a case is given in which only the first order moment and the second order moment are considered.
The first and second moments of the inner product between the weights of the last fully-connected layer in the neural network can be calculated, for example, by the following formula:
Figure BDA0001864498940000251
Figure BDA0001864498940000252
wherein n is the number of classes, wiIs the ith column of the weight W of the last fully connected layer in the neural network.
Cai, j.fan and t.jiang in 2013, article "Distributions of angle Random Packing on Spheres" of jmi.r.org, has indicated that Random vectors in high dimensional space are almost always nearly orthogonal.
Fig. 12 shows, by way of example, the distribution of the inner product of the weights of the last fully-connected layer in a neural network trained by arcfacce and the inner product of a random (random) vector. In this example, the dimension of the vector may be 512, for example, and the number of vectors is 20000. As can be seen from fig. 12, the discrimination between weight vectors in the neural network trained by Arcface is inferior to that of random vectors, and there is room for improvement.
Now, the first moment as the first loss is made
Figure BDA0001864498940000255
Minimized to constrain the axis of symmetry of the curve in fig. 12 to be around 0. However, it is not sufficient to distinguish w. For this purpose, a second moment is taken into account as a further loss
Figure BDA0001864498940000256
To make the curve steeper. By mixing
Figure BDA0001864498940000257
And
Figure BDA0001864498940000258
in combination, the neural network can be made to satisfy the requirement of making the weight as much as possibleCan be orthogonally constrained and as close as possible to the distribution of high-dimensional random vectors.
Specifically, the quadrature loss function can be expressed using the following formula:
Figure BDA0001864498940000253
where α and β are equilibrium coefficients.
Preferably, the orthogonal loss function can be used in combination with other loss functions to train the neural network model. The combined loss function can be expressed using the following formula:
Figure BDA0001864498940000254
both α and β are positive numbers.
Wherein the content of the first and second substances,
Figure BDA0001864498940000264
are other loss functions such as Softmax loss functions, angle/cosine pitch loss functions, etc.
Preferably, the orthogonal loss function can be optimized using the standard Stochalstic gradient descent method, since the problem is convex. The corresponding gradient is as follows:
Figure BDA0001864498940000261
preferably, the back propagation algorithm can be implemented, for example, using the following formula:
Figure BDA0001864498940000262
Figure BDA0001864498940000263
FIG. 13 illustrates, by way of example, a flow chart of a method of training a neural network using an orthogonal loss function in accordance with the present embodiments.
As shown in fig. 13, in step S1301, a first loss is calculated
Figure BDA0001864498940000265
A first step S1302 of calculating a first loss
Figure BDA0001864498940000266
In step S1303, joint loss is calculated
Figure BDA0001864498940000267
In step S1304, a back propagation error is calculated
Figure BDA0001864498940000268
In step S1305, neural network parameters (e.g., weights) are updated
Figure BDA0001864498940000269
According to this embodiment, the classes having importance higher than the predetermined threshold can be made as orthogonal as possible to the other classes in the training set, thereby improving training accuracy and recognition accuracy.
Tenth embodiment
In the present embodiment, a method of obtaining a plurality of different trained models by using a training method according to various embodiments of the present disclosure is described.
FIG. 14 shows a flow chart of an example of a multi-model training method according to the present embodiment.
As shown in fig. 14, the steps of the method are as follows:
in step S1401, a plurality of different training sets are obtained from a base training set.
In step S1402, for each of the plurality of different training sets, training is performed using a training method according to any one of the various embodiments of the present disclosure to obtain a plurality of different trained models.
Preferably, each of the plurality of different training sets is obtained by performing at least one of the following operations on the target region of the subject in the base training set: and (4) cutting and filtering. For example, the base training set may be a training set of face images from which a training set of eye images, a training set of nose images, and a training set of mouth images may be derived by cropping, and so on.
According to this embodiment, because the clipped training set is more concentrated on a certain region of the object, the differentiation of details is more accurate, so that the accuracy of the training model can be improved. Meanwhile, a plurality of models are independently trained, and the models are used for object recognition, so that the error caused by using a single model for recognition can be effectively reduced, and the recognition accuracy is improved.
Eleventh embodiment
In the present embodiment, one example of a combination of training methods according to an embodiment of the present disclosure is described.
Fig. 15 shows a flow chart of an example of a method according to an eleventh embodiment of the present disclosure.
As shown in fig. 15, in step S1501, features are extracted using an initial model of a neural network.
In step S1502, a more difficult class is obtained.
Preferably, the more difficult class can be obtained using the method described with reference to fig. 9A in the seventh embodiment.
In step S1503, the noise suppression loss function is calculated using the method in the eighth embodiment.
Preferably, the noise suppression loss function may be calculated using the pitch obtained by the training method described according to any one or any combination of the embodiments of the present disclosure.
In step S1504, the initial model is updated according to the calculated loss using a back propagation algorithm.
It should be noted that this combination is merely an example, which is intended to give the person skilled in the art an exemplary teaching of a combined implementation, so that other combined implementations are possible on the basis thereof. Thus, the manner of combining the embodiments of the present disclosure is not limited to the manner shown in fig. 15, of course, but any two or more of the embodiments of the present disclosure may be combined as needed.
Twelfth embodiment
In the present embodiment, an object recognition method will be described. The object recognition method includes extracting features from an input object using a trained model obtained using a training method according to any one or any combination of embodiments of the present disclosure to perform object recognition.
Preferably, the object recognition is performed by calculating a similarity between the extracted features and features of the real object.
Fig. 16 is a flowchart showing an example of an object recognition method according to the twelfth embodiment.
As shown in fig. 16, in step S1601, an image is input.
In step S1602, the image is preprocessed. The pre-processing may include, for example, face detection, face alignment, and the like.
In step S1603, feature extraction is performed using a neural network model. Preferably, the neural network model may be a trained neural network model obtained by a method according to any embodiment or any combination of embodiments of the present disclosure.
In step S1604, the input image is recognized by calculating the similarity between the extracted features of the object and the features of the real object in the input image.
According to this embodiment, because of the improved accuracy of the neural network model according to the present disclosure, the accuracy of object recognition is improved accordingly.
It should be noted that the steps of the methods according to the various embodiments of the present disclosure are not necessarily performed in the order shown, but may be performed in parallel or in other orders.
Thirteenth embodiment
As shown in fig. 17, in the present embodiment, an apparatus 1700 for training a neural network is described. The apparatus 1700 includes: feature extraction means 1701 for extracting features from the first training set using an initial model of the neural network, spacing parameter adjustment means 1702 for adjusting a spacing-associated parameter between the first class and at least one other class in accordance with a distribution of at least one attribute of each class in the first training set, loss calculation means 1703 for calculating a loss in accordance with the adjusted spacing-associated parameter, and model update means 1704 for updating the initial model using the calculated loss and a back propagation algorithm to obtain a trained model.
Fourteenth embodiment
As shown in fig. 18, in the present embodiment, an object recognition apparatus 1800 is described. The object recognition apparatus includes: an apparatus 1801 for training a neural network according to a thirteenth embodiment of the present disclosure; a feature extraction means 1802 configured to extract features from an input object using a trained model; and a similarity calculation means 1803 configured to calculate a similarity between the extracted features and the features of the real object.
The devices described in the thirteenth and fourteenth embodiments above are exemplary and/or preferred devices for implementing the methods described in this disclosure. These devices can achieve similar effects to the corresponding methods. These means may be hardware elements, such as field programmable gate arrays, digital signal processors, application specific integrated circuits or computers, etc., and/or software means, such as computer readable programs. The apparatus for performing the various steps has not been described in detail above. However, as long as there is a step of performing a certain process, there may be corresponding means (implemented by hardware and/or software) for implementing the same process. All the technical solutions defined by all the combinations of the described steps and the devices corresponding to the steps are included in the disclosure of the present disclosure as long as they constitute the technical solutions are complete and applicable.
Further, the above-described apparatus constituted by various means may be incorporated as a functional module into a hardware device such as a computer. In addition to these functional modules, the computer may of course have other hardware or software components.
Fifteenth embodiment
Computer apparatus to implement the methods and apparatus of the present disclosure. Fig. 19 is a block diagram showing a hardware configuration of a computer system capable of implementing an embodiment of the present disclosure.
As shown in fig. 19, the computer system includes a processing unit (processor) 1901, a read only memory 1902, a random access memory 1903, and an input/output interface 1905 connected via a system bus 1904, and an input unit 1906, an output unit 1907, a storage unit 1908, a communication unit 1909, and a driver 1910 connected via the input/output interface 1905. The program may be recorded in advance in a ROM (read only memory) 1902 or a storage unit 1908 as a recording medium built in the computer. Alternatively, the program may be stored (recorded) in the removable medium 1911. Herein, the removable medium 1911 includes, for example, a flexible disk, a CD-ROM (compact disc read only memory), an MO (magneto optical) disk, a DVD (digital versatile disc), a magnetic disk, a semiconductor memory, and the like.
The input unit 1906 is used for inputting a user request and is configured with a keyboard, a mouse, a touch screen, a microphone, a camera, and the like. In addition, the output unit 1907 is configured with an LCD (liquid crystal display), a speaker, and the like.
The communication unit 1909 may be, for example, a wireless communication unit comprising at least one transceiver module and a positioning module. The transceiver module is used to send requests to the remote server and receive responses from the remote server. The positioning module is, for example, a GPS module for acquiring a position.
The storage unit 1908 or the ROM 1902 stores images, audio, and the like. The RAM 903 may store temporary state information and intermediate calculation results.
Further, in addition to the configuration in which the program is installed from the above-mentioned removable medium 1911 to the computer system through the drive 1910, the program may be downloaded to the computer system through a communication network or a broadcast network to be installed in the built-in storage unit 1908. In other words, the program may be transmitted from a download point to the computer system by a satellite for digital satellite broadcasting, for example, in a wireless manner, or may be transmitted to the computer system by a wired manner through a network such as a LAN (local area network) or the internet.
If a command is input to the computer system via the input/output interface 1905 by a user manipulation or the like to the input unit 1906, the CPU 1901 executes a program stored in the ROM 1902 in accordance with the command. Alternatively, the CPU 1901 loads a program stored in the storage unit 1908 on the RAM 1903 to execute the program.
Accordingly, the CPU 1901 executes some processing according to the above-mentioned flowchart or processing executed by the above-mentioned configuration of the block diagram. Next, the CPU 1901 allows the result of the processing to be output from the output unit 1907, transmitted from the communication unit 1909, recorded in the storage unit 1908, and the like, if necessary, for example, through the input/output interface 1905.
In addition, the program may be executed by a computer (processor). In addition, the program may be processed by a plurality of computers in a distributed manner. In addition, the program may be transferred to a remote computer for execution.
The computer system shown in FIG. 19 is illustrative only and is in no way intended to be limiting of the present disclosure, its application, or uses. The computer system shown in fig. 19 may be implemented in any embodiment, as a stand-alone computer, or as a processing system in a device, from which one or more unnecessary components may be removed and to which one or more additional components may be added.
In one example, a computer system is implemented as an apparatus for training a neural network.
In yet another example, a computer system is implemented as an apparatus for identifying an object. The apparatus comprises a processor and a memory, the memory storing a computer program which, when executed by the processor, is capable of causing the apparatus to perform a method according to an embodiment of the disclosure.
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination thereof. The order of the method steps described above is merely illustrative, and the method steps of the present disclosure are not limited to the order specifically described above unless explicitly stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as a computer program stored in a computer-readable storage medium. The computer program, when executed by the processor 1901, is capable of causing the processor 1901 to perform a method in accordance with any embodiment or any combination of embodiments of the present disclosure. Thus, the present disclosure also covers a computer-readable storage medium storing a computer program for implementing the method according to the present disclosure.
Although some specific embodiments of the present disclosure have been described in detail by way of examples, it should be understood by those skilled in the art that the foregoing examples are illustrative only and are not limiting upon the scope of the disclosure. It will be appreciated by those skilled in the art that the above-described embodiments may be modified without departing from the scope and spirit of the disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (23)

1. A method of training a neural network, comprising:
extracting features from a first training set using an initial model of a neural network;
adjusting a parameter associated with the spacing between the first class and the at least one other class according to a distribution of the at least one attribute of the classes in the first training set;
calculating a loss from the adjusted parameter associated with the spacing;
the initial model is updated using the calculated loss and back propagation algorithms to arrive at a trained model.
2. The method of claim 1, wherein the at least one attribute is a number of samples of a class in the first training set.
3. The method according to any of claims 1-2, wherein the parameter associated with pitch is pitch itself and the method comprises calculating the loss using a pitch loss function.
4. The method of claim 3, wherein the adjusting comprises: the spacing is determined directly from the number of samples of the first class such that the spacing is inversely related to the number of samples.
5. The method of claim 3, wherein the adjusting comprises: the pitch is set as a function of the number of samples of the first class such that the pitch is inversely related to the number of samples.
6. The method of claim 3, wherein the adjusting comprises: and determining the change trend of the distance according to the current distance, the number of samples of the class, the intra-class variance and the inter-class distance by using a Q-learning function.
7. The method of claim 3, wherein the adjusting comprises: the pitch is adjusted by increasing the number of samples of the first class by feature augmentation.
8. The method of claim 7, wherein feature augmentation is performed using a neighbor distribution.
9. The method of claim 7, wherein feature augmentation is performed using a Gaussian distribution.
10. The method of claim 1, wherein the attribute is a difficulty of a class in the first training set.
11. The method of claim 10, wherein the difficulty level can be determined manually or based on a class center distance or a class center similarity between the first class and at least one other class.
12. A method according to any of claims 10-11, wherein the parameter associated with the spacing is a class-centric similarity, and the method comprises, for classes with a difficulty above a predetermined threshold, calculating the loss from the class-centric similarity.
13. The method of claim 1, wherein the attribute is an importance of a class in the first training set.
14. The method of claim 13, wherein the importance of a class can be determined manually or by the frequency with which the class is queried.
15. The method of claim 14, wherein the distance-related parameter is a relationship between feature spaces of classes, and the method comprises calculating the loss further by making feature spaces of classes with importance above a predetermined threshold as orthogonal as possible to feature spaces of any other class than the class.
16. The method of claim 1, wherein the calculating a loss comprises: based on the noise type and the adjusted spacing, a noise suppression loss function is used to calculate the loss.
17. A method of training a neural network, comprising:
obtaining a plurality of different training sets from a base training set;
training using the method according to any of claims 1-16 for each of the plurality of different training sets to obtain a plurality of different trained models.
18. The method of claim 17, wherein each of the plurality of different training sets is derived by at least one of: and (4) cutting and filtering.
19. An object recognition method characterized by comprising, in a first step,
object recognition is performed using a trained model obtained according to the method of any one of claims 1 to 18.
20. An apparatus for training a neural network, comprising a processor and a memory, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 18.
21. An apparatus for training a neural network, comprising means configured to perform the steps of the method according to any one of claims 1 to 18.
22. An object recognition apparatus, characterized in that the object recognition apparatus comprises:
a device for training a neural network as claimed in claim 21;
a feature extraction device configured to extract features from an input object using a trained model; and
a similarity calculation means configured to calculate a similarity between the extracted feature and a feature of the real object.
23. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 19.
CN201811349293.9A 2018-11-14 2018-11-14 Method of training neural network, object recognition method and apparatus, and medium Pending CN111191781A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811349293.9A CN111191781A (en) 2018-11-14 2018-11-14 Method of training neural network, object recognition method and apparatus, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811349293.9A CN111191781A (en) 2018-11-14 2018-11-14 Method of training neural network, object recognition method and apparatus, and medium

Publications (1)

Publication Number Publication Date
CN111191781A true CN111191781A (en) 2020-05-22

Family

ID=70706977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811349293.9A Pending CN111191781A (en) 2018-11-14 2018-11-14 Method of training neural network, object recognition method and apparatus, and medium

Country Status (1)

Country Link
CN (1) CN111191781A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598190A (en) * 2020-07-21 2020-08-28 腾讯科技(深圳)有限公司 Training method of image target recognition model, image recognition method and device
CN112434576A (en) * 2020-11-12 2021-03-02 合肥的卢深视科技有限公司 Face recognition method and system based on depth camera
CN113989519A (en) * 2021-12-28 2022-01-28 中科视语(北京)科技有限公司 Long-tail target detection method and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598190A (en) * 2020-07-21 2020-08-28 腾讯科技(深圳)有限公司 Training method of image target recognition model, image recognition method and device
CN111598190B (en) * 2020-07-21 2020-10-30 腾讯科技(深圳)有限公司 Training method of image target recognition model, image recognition method and device
CN112434576A (en) * 2020-11-12 2021-03-02 合肥的卢深视科技有限公司 Face recognition method and system based on depth camera
CN113989519A (en) * 2021-12-28 2022-01-28 中科视语(北京)科技有限公司 Long-tail target detection method and system
CN113989519B (en) * 2021-12-28 2022-03-22 中科视语(北京)科技有限公司 Long-tail target detection method and system

Similar Documents

Publication Publication Date Title
CN109598231B (en) Video watermark identification method, device, equipment and storage medium
CN108280477B (en) Method and apparatus for clustering images
Dobrišek et al. Towards efficient multi-modal emotion recognition
CN106294344B (en) Video retrieval method and device
WO2019015246A1 (en) Image feature acquisition
CN109919252B (en) Method for generating classifier by using few labeled images
JP2017091525A (en) System and method for attention-based configurable convolutional neural network (abc-cnn) for visual question answering
US9378464B2 (en) Discriminative learning via hierarchical transformations
US20200134455A1 (en) Apparatus and method for training deep learning model
CN111191781A (en) Method of training neural network, object recognition method and apparatus, and medium
Ghosh et al. The class imbalance problem in deep learning
CN113627151B (en) Cross-modal data matching method, device, equipment and medium
WO2021245277A1 (en) Self-supervised representation learning using bootstrapped latent representations
WO2023088174A1 (en) Target detection method and apparatus
US10163000B2 (en) Method and apparatus for determining type of movement of object in video
Franchi et al. Latent discriminant deterministic uncertainty
Gu et al. Unsupervised and semi-supervised robust spherical space domain adaptation
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
Wang et al. Intelligent radar HRRP target recognition based on CNN-BERT model
CN116109907B (en) Target detection method, target detection device, electronic equipment and storage medium
CN113762005A (en) Method, device, equipment and medium for training feature selection model and classifying objects
Du et al. More than accuracy: An empirical study of consistency between performance and interpretability
JP7270839B2 (en) General Representation Learning for Face Recognition
CN110609961A (en) Collaborative filtering recommendation method based on word embedding
TWI814619B (en) System and method for training sample generator with few-shot learning, and non-transitory computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200522