CN114004838B

CN114004838B - Target class identification method, training method and readable storage medium

Info

Publication number: CN114004838B
Application number: CN202210000748.6A
Authority: CN
Inventors: 艾国; 凌明; 杨作兴; 房汝明; 向志宏
Original assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Current assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-04-12
Anticipated expiration: 2042-01-04
Also published as: CN114004838A

Abstract

The embodiment of the invention provides a target class identification method, a training method and a readable storage medium. The method comprises the following steps: performing feature extraction on an image to be recognized to obtain a first feature vector; respectively performing characteristic degradation on the first characteristic vector under different granularities, and respectively performing spatial-level attention coefficient calculation on the characteristic vectors obtained after the characteristic degradation under the different granularities; for each characteristic value in each channel under each granularity, multiplying the characteristic value by the attention coefficient of the characteristic value to obtain a second characteristic vector; respectively adopting the second feature vectors under different granularities to carry out target classification calculation to obtain the probability that the image to be recognized contains each target class under different granularities; and for each target category, carrying out weighted calculation on the probability of the target category contained in the image to be recognized under different granularities to obtain the final probability of the target category contained in the image to be recognized. The embodiment of the invention further refines the granularity of target class identification.

Description

Target class identification method, training method and readable storage medium

Technical Field

Embodiments of the present invention relate to the field of image processing technologies, and in particular, to a target class identification method, a neural network training method for target class identification, a readable storage medium, and a computer program product.

Background

In many scenarios, the objects need to be classified for various purposes. And for different target categories with similar morphology and texture, the general image classification method is difficult to distinguish. For example: in the underground pipeline construction scene, in recent years, the urbanization process of China develops rapidly, with hundreds of millions of people flowing into the city, the pressure born by the underground pipeline is further aggravated, the underground pipeline construction is used as a very important basic task in the city construction process, the stability of the normal operation of the city is influenced, a pipe network system is overhauled in time, and the urban infrastructure construction stability is guaranteed.

At present, most of the detection methods for underground pipeline defects are that video data are shot by a robot when the robot goes into a well, then defect types are judged by screening acquired massive information manually, and finally a relevant report is generated. For example, the existing approach is to use a single image input to achieve classification of different pipe defects. However, the method has the following disadvantages:

one unique problem of neglecting pipeline defects is that most key defect types are widely distributed but actually occupy a small area proportion, such as: tree roots, cracks, staggers, and other defect types for which the method does not distinguish well.

Secondly, the method classifies the defects based on the global features extracted from the single image, so that the model can not capture the information with the self-discriminability of the defects easily and is easily influenced by noise such as background and the like.

And thirdly, when the similarity of the shape and the texture of the two types of defects is higher, such as: corrosion, scaling and other defect types, and the model provided by the method cannot be used for distinguishing the corrosion, the scaling and other defect types by depending on general semantic information on the whole.

Due to the above limitations, this method is only limited to be applied to seven types of pipe type defects, including: deformation, corrosion, scaling, stagger, deposition, leakage, and cracking. According to introduction of the released industry standard urban drainage pipeline detection and evaluation technical regulation of China ministry of urban and rural construction of housing, the defect types of underground drainage pipelines are 17 types, including: blind joints, deformations, misconnections, wall debris, penetrations, corrosion, scum, scale, undulations, roots, disjointing, shedding, obstructions, faults, deposits, leaks, and cracks. Obviously, the method cannot effectively and accurately distinguish the complete 17 defect types. In addition, the pipeline data is originally derived from video data, and if the defect classification of the full granularity (namely 17 defect categories) of the pipeline video data can be realized, the overhaul work of the urban underground pipe network can be greatly promoted.

Disclosure of Invention

The embodiment of the invention provides a target class identification method, a neural network training method for target class identification, a readable storage medium and a computer program product, so as to refine the class granularity of target class identification and improve the identification precision of target class identification.

The technical scheme of the embodiment of the invention is realized as follows:

a method of object class identification, the method comprising:

extracting features of an image to be recognized to obtain a first feature vector, wherein the dimension of the first feature vector is C1H 1W 1, C1 is the preset number of channels, H1 is the preset feature length of each channel, and W1 is the preset feature width of each channel;

respectively performing feature degradation on the first feature vector under different granularities, and respectively performing attention coefficient calculation of spatial levels on feature vectors obtained after feature degradation under different granularities to obtain an attention coefficient of each feature value of each channel under different granularities;

for each characteristic value in each channel under each granularity, multiplying the characteristic value by the attention coefficient of the characteristic value to obtain a spatial enhancement characteristic value corresponding to the characteristic value, wherein all the spatial enhancement characteristic values of all the channels under each granularity form a second characteristic vector under the granularity;

respectively adopting the second feature vectors under different granularities to carry out target classification calculation to obtain the probability that the image to be recognized contains each target class under different granularities;

and for each target category, carrying out weighted calculation on the probability of the target category contained in the image to be recognized under different granularities to obtain the final probability of the target category contained in the image to be recognized.

The step of performing target classification calculation by respectively adopting the second feature vectors under different granularities comprises the following steps:

and respectively carrying out channel-based global average pooling on the second feature vectors under different granularities to obtain a global average feature value of each channel under different granularities, and respectively adopting the global average feature values of all the channels under different granularities to carry out target classification calculation under different granularities, thereby obtaining the probability that the image to be recognized contains each target category under different granularities.

The performing feature degradation on the first feature vector under different granularities respectively includes:

let the current particle size be m²All eigenvalues in each channel of the first eigenvector are then divided into m²Sub-regions, each sub-region containing a number of eigenvalues of (H1/m) × (W1/m), for said m²Every m at the same position in a sub-region²A characteristic value, taking the m²The maximum eigenvalue of the eigenvalues is taken, so that (H1/m) × (W1/m) maximums are obtained, (H1/m) × (W1/m) maximums form the eigenvalue after the characteristic degradation of the channel, and the eigenvalue after the characteristic degradation of all the channels forms the current granularity m²And (5) a feature vector after feature degradation, wherein m is an integer not less than 1.

The image to be identified is a pipeline image, and the target category is a pipeline defect category.

The pipe defect categories include: blind joints, deformations, misconnections, wall debris, penetrations, corrosion, scum, scale, undulations, roots, disjointing, shedding, obstructions, faults, deposits, leaks, or/and cracks.

A method of training a neural network for target class recognition, the method comprising:

acquiring images in a multi-frame target category identification scene as training images;

sequentially inputting the training images into a backbone network of a neural network for feature extraction to obtain a first feature vector;

respectively performing feature degradation on the first feature vectors under different granularities, and respectively inputting the feature vectors obtained after the feature degradation under different granularities into an attention module of a neural network at a spatial level corresponding to the granularity for performing attention coefficient calculation at the spatial level to obtain an attention coefficient of each feature value of each channel under different granularities;

respectively inputting the second feature vectors under different granularities into the full-connection layer of the neural network with the corresponding granularity for carrying out target classification calculation to obtain the probability that the image to be recognized contains each target class under different granularities;

for each target category, carrying out weighted calculation on the probability of the target category contained in the image to be recognized under different granularities to obtain the final prediction probability of the target category contained in the image to be recognized;

calculating the final prediction probability of each target category in the training image and the real probability of each target category in the training image to obtain a prediction loss value, and using the prediction loss value for updating the neural network parameters;

when the neural network converges, the neural network at that time is taken as the neural network to be finally used.

The step of inputting the second feature vectors under different granularities into the fully-connected layer of the neural network with the corresponding granularity for target classification calculation includes:

and respectively carrying out channel-based global average pooling on the second feature vectors under different granularities to obtain a global average feature value of each channel under different granularities, and respectively inputting the global average feature values of all the channels under each granularity into a full-connection layer of the neural network under the corresponding granularity to carry out target classification calculation under the corresponding granularity, thereby obtaining the probability that the image to be recognized under different granularities contains each target class.

let the current particle size be m²All eigenvalues in each channel of the first eigenvector are then divided into m²Sub-regions, each sub-region containing a number of eigenvalues of (H1/m) × (W1/m), for m²Every m at the same position in a sub-region²A characteristic value of m²The maximum eigenvalue of the eigenvalues is taken, so that (H1/m) × (W1/m) maximum values are obtained, (H1/m) × (W1/m) maximum values form the eigenvalue of the channel after characteristic degradation, and the eigenvalue of all channels after characteristic degradation forms the current granularity m²And (5) a feature vector after feature degradation, wherein m is an integer not less than 1.

After the obtaining of the image in the multi-frame target class recognition scene as the training image and before the sequentially inputting of the training image into the backbone network of the neural network for feature extraction, the method further includes:

the method comprises the steps of collecting video streams of targets to be identified, dividing the collected video streams into a preset third number of video segments, identifying target categories contained in each video segment, sampling a fourth number of frame images from each video segment as training images, and marking the probability of each target category contained in each training image on each training image, wherein for any training image, if the video segment to which the training image belongs contains one target category, the probability of marking the target category on the training image is 100%, and if the video segment to which the training image does not contain one target category, the probability of marking the target category on the training image is 0%.

A non-transitory computer readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the steps of the target class identification method as defined in any one of the above, or the steps of the training method for a neural network for target class identification as defined in any one of the above.

In the embodiment of the invention, by respectively carrying out characteristic degradation under different granularities on the first characteristic vector extracted from the image to be identified, the attention coefficient calculation of the space level is respectively carried out on the feature vectors obtained after the feature degradation under different granularities, so as to obtain the space enhanced feature vector under each granularity, respectively adopting the space enhanced feature vectors under different granularities to carry out target classification calculation, obtaining the probability that the image to be identified contains each target category under different granularities, and then, for each target class, the probability of the target class contained in the image to be recognized under different granularities is weighted and calculated to obtain the final probability of the target class contained in the image to be recognized, thereby realizing the fine-grained classification of the target class, the target categories with unobvious discriminability characteristics such as wide distribution, small actual occupied area ratio and the like can be effectively distinguished; in addition, the method adopts the characteristic of multiple granularities, so that the target classes with high confusion among the classes can be effectively distinguished. Therefore, the method further refines the category range of the target category identification and improves the identification precision of the target category identification.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

Fig. 1 is a flowchart of a target class identification method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a training method of a neural network for target class recognition according to an embodiment of the present invention;

FIG. 3 is a flowchart of a training method for a neural network for target class recognition according to another embodiment of the present invention;

fig. 4 is a schematic structural diagram of an object class identification apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a training apparatus for a neural network for target class recognition according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail with specific examples. Several of the following embodiments may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 1 is a flowchart of a target class identification method according to an embodiment of the present invention, which includes the following specific steps:

step 101: and performing feature extraction on the image to be identified to obtain a first feature vector, wherein the dimension of the first feature vector is C1H 1W 1, C1 is the preset number of channels, H1 is the preset feature length of each channel, and W1 is the preset feature width of each channel.

Step 102: and respectively performing characteristic degradation on the first characteristic vector under different granularities, and respectively performing attention coefficient calculation of spatial levels on the characteristic vectors obtained after the characteristic degradation under the different granularities to obtain the attention coefficient of each characteristic value of each channel under the different granularities.

Wherein, the degradation of features at different granularities means that the size features of H1W 1 are divided into, for example, 3 × 3, 6 × 6 and 9 × 9 sub-regions from coarse to fine. The sizes of each subregion corresponding to the three granularity characteristics are respectively (H1/3) × (W1/3), (H1/6) × (W1/6) and (H1/9) × (W1/9). The corresponding shapes and sizes of the three granularity characteristics are respectively as follows: c1 × 9 (H1/3) ((W1/3)), C1 × 36 (H1/6) ((W1/6)) and C1 × 81 (H1/9) ((W1/9)). Next, a maximum operation was performed on the second dimension of each particle size feature, resulting in different particle size features with shape sizes of C1 (H1/3) × (W1/3), C1 (H1/6) × (W1/6) and C1 (H1/9) × (W1/9). The operation summarizes the most distinctive features from different spatial regions, so that on one hand, the interference of background features can be removed, and on the other hand, the most meaningful information for classification can be mined.

Step 103: and for each characteristic value in each channel under each granularity, multiplying the characteristic value by the attention coefficient of the characteristic value to obtain a spatial enhancement characteristic value corresponding to the characteristic value, wherein all the spatial enhancement characteristic values of all the channels under each granularity form a second characteristic vector under the granularity.

Wherein, the value range of the attention coefficient is [0,1 ]. The greater the value of the attention coefficient, the greater the degree of importance representing the corresponding feature value; conversely, the less important the corresponding feature value is.

Step 104: and respectively adopting the second feature vectors under different granularities to carry out target classification calculation so as to obtain the probability that the image to be recognized contains each target class under different granularities.

Step 105: and for each target category, carrying out weighted calculation on the probability of the target category contained in the image to be recognized under different granularities to obtain the final probability of the target category contained in the image to be recognized.

A plurality of object classes may be contained in one frame of image, and only the probability that each object class is contained in the image to be recognized is given here for the user to refer to. The value range of the probability of each target class is as follows: [0%,100% ].

In the above embodiment, by performing feature degradation at different granularities on the first feature vectors extracted from the image to be recognized, the attention coefficient calculation of the space level is respectively carried out on the feature vectors obtained after the feature degradation under different granularities, so as to obtain the space enhanced feature vector under each granularity, respectively adopting the space enhanced feature vectors under different granularities to carry out target classification calculation, obtaining the probability that the image to be identified contains each target category under different granularities, and then, for each target class, the probability of the target class contained in the image to be recognized under different granularities is weighted and calculated to obtain the final probability of the target class contained in the image to be recognized, thereby realizing the fine-grained classification of the target class, the target categories with unobvious discriminability characteristics such as wide distribution, small actual occupied area ratio and the like can be effectively distinguished; in addition, the method adopts the characteristic of multiple granularities, so that the target classes with high confusion among the classes can be effectively distinguished. Therefore, the method refines the category range of the target category identification and improves the identification precision of the target category identification.

In an optional embodiment, step 101 specifically includes:

step 1011: extracting channel features of an image to be identified to obtain a channel feature vector, wherein the dimension of the channel feature vector is C1H 1W 1, C1 is the preset number of channels, H1 is the preset feature length of each channel, and W1 is the preset feature width of each channel.

Step 1012: and performing channel-level attention coefficient calculation on the channel feature vectors to obtain the attention coefficient of each channel.

Step 1013: and for each channel in the channel feature vector, multiplying each feature value in the channel by the attention coefficient of the channel to obtain each channel enhancement feature value of the channel, wherein all the channel enhancement feature values of all the channels form a first feature vector.

In the embodiment, the attention coefficient of each channel in the channel feature vector is calculated, and each feature value in each channel in the channel feature vector is multiplied by the attention coefficient of the channel to obtain the first feature vector, so that the target feature in the image to be recognized is enhanced, the non-target feature in the image to be recognized is weakened, and the accuracy of target class recognition is finally improved.

In an optional embodiment, in step 1011, performing channel feature extraction on the image to be recognized includes: inputting an image to be identified into a backbone network of a neural network for channel feature extraction;

in step 1012, the calculating of the attention coefficient at the channel level for the channel feature vector includes: and inputting the channel feature vector into a channel-level attention module of the neural network to calculate the channel-level attention coefficient.

After the calculation of the attention module at the channel level, useful channel characteristics are endowed with a larger attention coefficient, and useless channel characteristics are endowed with a smaller attention coefficient, so that the useful channel characteristics are strengthened, and the useless channel characteristics are weakened.

In the above embodiment, the extraction of the channel feature of the image to be recognized is realized through the backbone network in the neural network, and the calculation of the attention coefficient of the channel level of the channel feature vector is realized through the attention module of the neural network.

In an optional embodiment, in step 101, performing feature extraction on the image to be recognized includes: inputting an image to be identified into a backbone network of a neural network for feature extraction;

in step 102, the calculation of attention coefficients of spatial levels is performed on feature vectors obtained after feature degradation under different granularities, and the calculation includes: respectively inputting the feature vectors obtained under different granularities after feature degradation into attention modules of the neural network at the spatial level of the corresponding granularity for spatial level attention coefficient calculation;

in step 104, performing target classification calculation by respectively using the second feature vectors at different granularities, including: and respectively inputting the second feature vectors under different granularities into the full connection layer of the neural network with the corresponding granularity to perform target classification calculation.

In the embodiment, the feature extraction of the image to be recognized is realized through the backbone network of the neural network, the calculation of the attention coefficient of the spatial level of the feature vector after the feature degradation is realized through the attention modules of the spatial levels of the neural network under different granularities, and the target class recognition under different granularities is realized through the full connection layers of the neural network under different granularities.

In an optional embodiment, in step 104, performing the target classification calculation by respectively using the second feature vectors at different granularities includes: and respectively carrying out channel-based global average pooling on the second feature vectors under different granularities to obtain a global average feature value of each channel under different granularities, and respectively adopting the global average feature values of all the channels under different granularities to carry out target classification calculation under the granularity, thereby obtaining the probability that the to-be-recognized images under different granularities contain all target classes.

In an optional embodiment, in step 102, performing feature degradation on the first feature vector under different granularities respectively includes: let the current particle size be m²All eigenvalues in each channel of the first eigenvector are then divided into m²Sub-regions, each sub-region containing a number of eigenvalues of (H1/m) × (W1/m), for which m²Every m at the same position in a sub-region²A characteristic value of m²The maximum eigenvalue of the eigenvalues is taken to be (H1/m) ((W1/m)) the maximum value, the (H1/m) ((W1/m)) the maximum value forms the characteristic degraded eigenvalue of the channel, and the characteristic degraded eigenvalues of all the channels form the current granularity m²And (5) a feature vector after feature degradation, wherein m is an integer not less than 1.

In the above embodiment, by collecting features at different granularities, discriminative feature representations from different positions are aggregated, and information such as background noise is reduced to participate in the final classification decision. One of the two benefits is that the target category feature representation with wide distribution and small actual area ratio can be kept to the maximum extent, and meanwhile, background information which is meaningless for target category distinguishing is filtered out; and the other is to aggregate local representation information of different areas, which has important significance for distinguishing the categories with high confusion among the categories.

In an optional embodiment, before step 101, further comprising: the method comprises the steps of collecting video streams of a target to be identified, dividing the collected video streams into preset first number of video segments, and sampling second number of frame images from each video segment to serve as images to be identified.

Fig. 2 is a flowchart of a training method of a neural network for target class identification according to an embodiment of the present invention, which includes the following specific steps:

step 201: and acquiring images in a multi-frame target class identification scene as training images.

Step 202: and sequentially inputting the training images into a backbone network of the neural network for feature extraction to obtain a first feature vector.

Step 203: and respectively performing characteristic degradation on the first characteristic vectors under different granularities, and respectively inputting the characteristic vectors obtained after the characteristic degradation under the different granularities into attention modules of the neural network at the space level corresponding to the granularity for performing attention coefficient calculation at the space level to obtain the attention coefficient of each characteristic value of each channel under the different granularities.

Step 204: and for each characteristic value in each channel under each granularity, multiplying the characteristic value by the attention coefficient of the characteristic value to obtain a spatial enhancement characteristic value corresponding to the characteristic value, wherein all the spatial enhancement characteristic values of all the channels under each granularity form a second characteristic vector under the granularity.

Step 205: and respectively inputting the second feature vectors under different granularities into the full-connection layer of the neural network with the corresponding granularity for carrying out target classification calculation to obtain the probability that the image to be recognized contains each target class under different granularities.

Step 206: and for each target category, carrying out weighted calculation on the probability of the target category contained in the image to be recognized under different granularities to obtain the final prediction probability of the target category contained in the image to be recognized.

Step 207: and calculating the final prediction probability of each target class in the training image and the real probability of each target class in the training image through a cross entropy function to obtain a prediction loss value, and using the prediction loss value for updating the neural network parameters through back propagation gradients.

Step 208: when the neural network converges, the neural network at that time is taken as the neural network to be finally used.

In an optional embodiment, in the step 205, the step of inputting the second feature vectors at different granularities into the fully-connected layer of the neural network at the corresponding granularity for performing the target classification calculation includes: and respectively carrying out channel-based global average pooling on the second feature vectors under different granularities to obtain a global average feature value of each channel under different granularities, and respectively inputting the global average feature values of all the channels under each granularity into a full-connection layer of the neural network under the corresponding granularity to carry out target classification calculation under the corresponding granularity, thereby obtaining the probability that the image to be recognized under different granularities contains each target class.

In an optional embodiment, in step 203, the performing feature degradation on the first feature vector under different granularities respectively includes: let the current particle size be m²All eigenvalues in each channel of the first eigenvector are then divided into m²Sub-regions, each sub-region containing a number of eigenvalues of (H1/m) × (W1/m), for which m²Every m at the same position in a sub-region²A characteristic value of m²The maximum eigenvalue of the eigenvalues thus totals (H1/m) × (W1/m) maxima, which (H1/m) × (W1/m) maxima constitute the eigenvalues of the channel after characteristic degradation, all channels having a maximum eigenvalue of the channelThe characteristic value after characteristic degradation forms the current granularity m²And (5) a feature vector after feature degradation, wherein m is an integer not less than 1.

In an optional embodiment, between the

steps

201 and 202, the method further includes: the method comprises the steps of collecting video streams of targets to be recognized, dividing the collected video streams into a preset third number of video segments, recognizing target types contained in each video segment, sampling a fourth number of frame images from each video segment as training images, and marking the probability of each target type contained in each training image on each training image, wherein for any training image, if the video segment to which the training image belongs contains one target type, the probability of marking the target type on the training image is 100%, and if the video segment to which the training image belongs does not contain one target type, the probability of marking the target type on the training image is 0%.

The image to be identified in the embodiment of the invention can be a pipeline image, and correspondingly, the target class is a pipeline defect class.

In an alternative embodiment, the pipe defect categories include: blind joints, deformations, misconnections, wall debris, penetrations, corrosion, scum, scale, undulations, roots, disjointing, shedding, obstructions, faults, deposits, leaks, or/and cracks.

Fig. 3 is a flowchart of a training method of a neural network for target class recognition according to another embodiment of the present invention, which includes the following specific steps:

step 301: receiving a video stream collected aiming at the position of a target, segmenting the collected video stream according to a preset segmentation length, and labeling the target category appearing in each video segment.

Step 302: respectively randomly sampling a frame of image from each video segment, and labeling a target category appearing in each frame of sampled image, wherein the target category appearing in each frame of sampled image is the target category appearing in the video segment to which the frame of sampled image belongs.

Step 303: randomly selecting an enhancement processing method from preset enhancement processing methods for each frame of sampling image, and performing enhancement processing on the frame of sampling image by adopting the selected enhancement processing method; selecting images with a preset proportion from the enhanced sampling images as training images, putting the training images into a training set, taking the rest images as verification images, and putting the verification images into a verification set.

The enhancement treatment method comprises the following steps: random left-right turning, brightness dithering, random small-angle rotation and the like.

The value of the preset proportion can be set according to the requirement, such as: the setting was 80%.

Step 304: and sequentially inputting each frame of training image in the training set to a backbone network of the neural network in a first vector form so as to extract channel characteristics of the training image, and outputting a second vector by the backbone network.

Such as: if the number of channels of the training image is C0 (e.g., if the training image is an RGB image, C0= 3), and the resolution is H0 × W0 (i.e., the height of the training image is H0 and the width is W0), the training image may be represented as a first vector of C0 × H0 × W0.

Assuming that the number of channels output by the backbone network is C1 and the number of down-sampling times is n, the dimensionality of the second vector output by the backbone network is: c1 × H1 × W1, wherein H1= H0/n, W1= W0/n. Wherein, C1 generally takes 256, 512 or 1024.

The backbone network may employ a ResNet50 network.

Step 305: the second vector is output to a channel-level attention (channelwisetantent) module of the neural network to calculate an attention coefficient for each channel, the attention coefficients of all channels constituting a third vector.

The dimension of the third vector is C1, that is, each channel of C1 channels corresponds to an attention coefficient, and the value range of the attention coefficient is [0,1 ].

Step 306: and for each channel in the second vector, multiplying each characteristic value in the channel by the attention coefficient of the channel in the third vector to obtain each channel enhancement characteristic value in the channel, wherein all the channel enhancement characteristic values of all the channels form a fourth vector.

That is, for H1 × W1 eigenvalues in any one of C1 of C1 channels of the second vector, each eigenvalue of H1 × W1 eigenvalues is multiplied by the attention coefficient of C1 of the channel in the third vector, 1 ≦ C1 ≦ C1.

The dimension of the fourth vector is the same as the second vector, and is still: c1 × H1 × W1.

Step 307: inputting the fourth vector into a coarse-grained spatial-level attention (spatialwisetanteption) module of the neural network, obtaining an attention coefficient of each eigenvalue in the fourth vector, wherein the attention coefficients of all eigenvalues in the fourth vector form a fifth vector.

Each eigenvalue in the fourth vector is the channel enhancement eigenvalue in step 306.

The dimension of the fifth vector is the same as the fourth vector, and is still: c1 × H1 × W1. The attention coefficient in this step has a value range of [0,1 ].

Step 308: and multiplying each feature value in the fourth vector by the attention coefficient of the feature value in the fifth vector to obtain a spatial enhancement feature value corresponding to each feature value in the fourth vector, wherein the spatial enhancement feature values corresponding to all the feature values in the fourth vector form a sixth vector.

The sixth vector also has dimensions C1 × H1 × W1.

Step 309: and performing global average pooling on the feature values in each channel in the sixth vector to obtain a global average feature value of each channel, wherein the global average feature values of all the channels form a seventh vector.

And performing global average pooling on the eigenvalues in each channel in the sixth vector, namely averaging H1 × W1 eigenvalues in each channel C1 (1 ≦ C1 ≦ C1) in C1 channels of the sixth vector to obtain a global average eigenvalue of each channel C1. The dimension of the seventh vector is C1.

Step 310: and inputting the seventh vector into a full connection layer of the neural network under the coarse granularity to obtain the probability that the training image under the coarse granularity contains each target category.

Step 311: for the fourth directionEach channel in the quantity, all the characteristic values in the channel being uniformly divided into p²（p>1) Each sub-region comprises a sub-feature vector corresponding to the feature value, and the dimension of the sub-feature vector is as follows: (H1/p) (W1/p), for p²P of the same position in each sub-region²And taking the maximum eigenvalue of each eigenvalue, and finally forming an eighth vector by all the obtained maximum eigenvalues of all channels.

The fourth vector has the dimension of C1H 1W 1, and all characteristic values in each channel are uniformly divided into p²After the sub-regions, the dimension of the sub-feature vector corresponding to each sub-region is (H1/p) × (W1/p), and the p is aimed at²Every p at the same position in a sub-region²A characteristic value of each p²The maximum values of the characteristic values are respectively taken, so that (H1/p) × (W1/p) maximum values are taken in total. C1 channels were taken to the maximum eigenvalues of C1 × (H1/p) × (W1/p), and the dimension of the eighth vector was C1 × (H1/p) (W1/p).

Step 312: and inputting the eighth vector to a spatial-level attention module of medium granularity of the neural network to obtain an attention coefficient of each characteristic value in the eighth vector, wherein the attention coefficients of all the characteristic values in the eighth vector form a ninth vector.

The dimension of the ninth vector is the same as the eighth vector, and still is: c1 (H1/p) (W1/p). The attention coefficient in this step has a value range of [0,1 ].

Step 313: and multiplying each feature value in the eighth vector by the attention coefficient of the feature value in the ninth vector to obtain a spatial enhancement feature value corresponding to each feature value in the eighth vector, wherein the spatial enhancement feature values corresponding to all the feature values in the eighth vector form a tenth vector.

The dimension of the tenth vector is also C1 (H1/p) (W1/p).

Step 314: and performing global average pooling on the feature values in each channel in the tenth vector to obtain a global average feature value of each channel, wherein the global average feature values of all the channels form an eleventh vector.

And performing global average pooling on the eigenvalues in each channel in the tenth vector, namely averaging (H1/p) × (W1/p) eigenvalues in each channel C1 (1 ≦ C1 ≦ C1) in the C1 channels of the tenth vector to obtain a global average eigenvalue of each channel C1. The dimension of the eleventh vector is C1.

Step 315: and inputting the eleventh vector into a full-connection layer of the neural network under the medium granularity to obtain the probability that the training image under the medium granularity contains each target category.

Step 316: for each channel in the fourth vector, uniformly dividing all eigenvalues in the channel into q²（q>p>1) Each sub-region comprises a sub-feature vector corresponding to the feature value, and the dimension of the sub-feature vector is as follows: (H1/q) ((W1/q)), for the q²Q of the same position in each sub-area²And taking the maximum eigenvalue of each eigenvalue, and finally forming a twelfth vector by all the obtained maximum eigenvalues of all channels.

The fourth vector has the dimension of C1H 1W 1, and all characteristic values in each channel are uniformly divided into q²After the sub-regions, the dimension of the sub-feature vector corresponding to each sub-region is (H1/q) × (W1/q), and the q is aimed at²Each q at the same position in a sub-region²A characteristic value of each q²The maximum eigenvalue is taken from the eigenvalues, and thus (H1/q) × (W1/q) maximum values are taken in total. The total of C1 channels was found to be C1 (H1/q) × (W1/q) maxima, and the dimension of the twelfth vector was C1 (H1/q) × (W1/q).

Step 317: and inputting the twelfth vector into a fine-grained spatial-level attention module of the neural network to obtain an attention coefficient of each characteristic value in the twelfth vector, wherein the attention coefficients of all the characteristic values in the twelfth vector form a thirteenth vector.

The dimension of the thirteenth vector is the same as the twelfth vector, still: c1 (H1/q) (W1/q). The attention coefficient in this step has a value range of [0,1 ].

Step 318: and multiplying each feature value in the twelfth vector by the attention coefficient of the feature value in the thirteenth vector to obtain a spatial enhancement feature value corresponding to each feature value in the twelfth vector, wherein the spatial enhancement feature values corresponding to all the feature values in the twelfth vector form a fourteenth vector.

The fourteenth vector is also of dimension C1 (H1/q) (W1/q).

Step 319: and performing global average pooling on the feature values in each channel in the fourteenth vector to obtain a global average feature value of each channel, wherein the global average feature values of all the channels form a fifteenth vector.

And performing global average pooling on the eigenvalues in each channel in the fourteenth vector, namely averaging (H1/q) × (W1/q) eigenvalues in each channel C1 (1 ≦ C1 ≦ C1) in the C1 channels of the fourteenth vector to obtain a global average eigenvalue of each channel C1. The dimension of the fifteenth vector is C1.

Step 320: and inputting the fifteenth vector into a full-connection layer of the neural network under the fine granularity to obtain the probability that the training image under the fine granularity contains each target class.

Step 321: according to the probabilities of each target class in the coarse-grained, medium-grained and fine-grained training images obtained in

steps

310, 315 and 320, respectively, the probabilities of the target classes in the coarse-grained, medium-grained and fine-grained training images are weighted and summed for each target class to obtain the final prediction probability of the target classes in the training images.

And if the weights corresponding to the probabilities of the same target class in the training images under the coarse granularity, the medium granularity and the fine granularity are respectively alpha, beta and gamma, then alpha + beta + gamma = 1.

Step 322: and calculating a prediction loss value through a cross entropy function (cross entropy) on the final prediction probability containing each target class in the training image and the real probability containing each target class in the training image, and using the prediction loss value for updating the neural network parameters in a back propagation gradient.

If the training image contains a certain target class, the true probability that the training image contains the target class is 100%, otherwise, the true probability is 0%.

Wherein, steps 307 to 310, steps 311 to 315, and steps 316 to 320 are performed in parallel.

Fig. 4 is a schematic structural diagram of an object class identification apparatus according to an embodiment of the present invention, where the apparatus mainly includes: a feature extraction module 41, a spatial attention module 42 under multiple granularities, a spatial enhancement module 43 under multiple granularities, a target classification module 44 under multiple granularities, and a weighting calculation module 45, wherein:

the feature extraction module 41 is configured to perform feature extraction on the image to be identified to obtain a first feature vector, where a dimension of the first feature vector is C1 × H1 × W1, where C1 is a preset number of channels, H1 is a preset feature length of each channel, and W1 is a preset feature width of each channel.

The feature degradation and spatial attention module 42 under multiple granularities is configured to perform feature degradation under different granularities on the first feature vector, and perform spatial-level attention coefficient calculation on feature vectors obtained after feature degradation under different granularities, so as to obtain an attention coefficient of each feature value of each channel under different granularities.

And a spatial enhancement module 43 under multiple granularities, configured to, for each feature value in each channel under each granularity, multiply the feature value by the attention coefficient of the feature value to obtain a spatial enhancement feature value corresponding to the feature value, where all spatial enhancement feature values of all channels under each granularity constitute a second feature vector under the granularity.

And the target classification module 44 under multiple granularities is used for performing target classification calculation by respectively adopting the second feature vectors under different granularities to obtain the probability that the image to be recognized contains each target class under different granularities.

And the weighting calculation module 45 is used for performing weighting calculation on the probability that the target category is contained in the image to be recognized under different granularities for each target category to obtain the final probability that the target category is contained in the image to be recognized.

Fig. 5 is a schematic structural diagram of a training apparatus for a neural network for target class identification according to an embodiment of the present invention, where the apparatus mainly includes: a training image obtaining module 51, a feature extraction module 52, a multi-granularity feature degradation and spatial attention module 53, a multi-granularity spatial enhancement module 54, and a multi-granularity target classification module 55, wherein:

and a training image obtaining module 51, configured to obtain images in the multi-frame target class identification scene as training images.

And the feature extraction module 52 is configured to sequentially input the training images acquired by the training image acquisition module 51 into a backbone network of the neural network for feature extraction, so as to obtain a first feature vector.

The feature degradation and spatial attention module 53 under multiple granularities is configured to perform feature degradation under different granularities on the first feature vectors obtained by the feature extraction module 52, and input the feature vectors after feature degradation under different granularities into the attention module of the neural network at the spatial level corresponding to the granularity for performing spatial level attention coefficient calculation, so as to obtain an attention coefficient of each feature value of each channel under different granularities.

And a spatial enhancement module 54 under multiple granularities, configured to, according to the feature degradation result of the feature degradation and spatial attention module 53 under multiple granularities and the obtained attention coefficient, for each feature value in each channel under each granularity, multiply the feature value by the attention coefficient of the feature value to obtain a spatial enhancement feature value corresponding to the feature value, where all spatial enhancement feature values of all channels under each granularity constitute a second feature vector under the granularity.

The multi-granularity target classification module 55 is configured to input the second feature vectors obtained by the multi-granularity spatial enhancement module 54 under different granularities into the full connection layer of the neural network with the corresponding granularity for performing target classification calculation, so as to obtain probabilities that the to-be-recognized images under different granularities contain each target class; for each target category, carrying out weighted calculation on the probability of the target category contained in the image to be recognized under different granularities to obtain the final prediction probability of the target category contained in the image to be recognized; calculating the final prediction probability of each target category contained in the training image and the real probability of each target category contained in the training image through a cross entropy function to obtain a prediction loss value, and using the prediction loss value for updating the neural network parameters through back propagation gradients; when the neural network converges, the neural network at that time is taken as the neural network to be finally used.

Embodiments of the present invention further provide a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the steps of the target class identification method or the training method of the neural network for target class identification described in any of the above embodiments are implemented.

Embodiments of the present invention also provide a computer-readable storage medium storing instructions that, when executed by a processor, may perform the steps in the target class identification method or the training method of a neural network for target class identification as described above. In practical applications, the computer readable medium may be included in each device/apparatus/system of the above embodiments, or may exist separately and not be assembled into the device/apparatus/system. Wherein instructions are stored in a computer readable storage medium, which stored instructions, when executed by a processor, may perform the steps in the above object class recognition method or the training method of a neural network for object class recognition.

According to embodiments disclosed herein, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example and without limitation: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing, without limiting the scope of the present disclosure. In the embodiments disclosed herein, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

As shown in fig. 6, an embodiment of the present invention further provides an electronic device. As shown in fig. 6, it shows a schematic structural diagram of an electronic device according to an embodiment of the present invention, specifically:

the electronic device may include a processor 61 of one or more processing cores, memory 62 of one or more computer-readable storage media, and a computer program stored on the memory and executable on the processor. The above-described object class identification method or the training method of the neural network for object class identification may be implemented when the program of the memory 62 is executed.

Specifically, in practical applications, the electronic device may further include a power supply 63, an input/output unit 64, and the like. Those skilled in the art will appreciate that the configuration of the electronic device shown in fig. 6 is not intended to be limiting of the electronic device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components. Wherein:

the processor 61 is a control center of the electronic device, connects various parts of the entire electronic device by various interfaces and lines, performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 62 and calling data stored in the memory 62, thereby performing overall monitoring of the electronic device.

The memory 62 may be used to store software programs and modules, i.e., the computer-readable storage media described above. The processor 61 executes various functional applications and data processing by executing software programs and modules stored in the memory 62. The memory 62 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 62 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 62 may also include a memory controller to provide the processor 61 access to the memory 62.

The electronic device further comprises a power supply 63 for supplying power to the various components, which can be logically connected to the processor 61 via a power management system, so as to implement functions of managing charging, discharging, and power consumption via the power management system. The power supply 63 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may also include an input-output unit 64, the input-unit output 64 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. The input unit output 64 may also be used to display information input by or provided to the user as well as various graphical user interfaces, which may be made up of graphics, text, icons, video, and any combination thereof.

The flowchart and block diagrams in the figures of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments disclosed herein. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not explicitly recited in the present application. In particular, the features recited in the various embodiments and/or claims of the present application may be combined and/or coupled in various ways, all of which fall within the scope of the present disclosure, without departing from the spirit and teachings of the present application.

The principles and embodiments of the present invention are explained herein using specific examples, which are provided only to help understanding the method and the core idea of the present invention, and are not intended to limit the present application. It will be appreciated by those skilled in the art that changes may be made in this embodiment and its broader aspects and without departing from the principles, spirit and scope of the invention, and that all such modifications, equivalents, improvements and equivalents as may be included within the scope of the invention are intended to be protected by the claims.

Claims

1. An object class identification method, characterized in that the method comprises:

for each characteristic value in each channel under each granularity, multiplying the characteristic value by the attention coefficient of the characteristic value to obtain a spatial enhancement characteristic value corresponding to the characteristic value, wherein all the spatial enhancement characteristic values of all the channels under each granularity form a second characteristic vector under each granularity;

for each target category, carrying out weighted calculation on the probability of the target category contained in the image to be recognized under different granularities to obtain the final probability of the target category contained in the image to be recognized;

let the current particle size be m²All eigenvalues in each channel of the first eigenvector are then divided into m²Sub-regions, each sub-region containing a number of eigenvalues of (H1/m) × (W1/m), for said m²M at the same position in a sub-region²A characteristic value, taking the m²The maximum eigenvalue of the eigenvalues is taken, so that (H1/m) × (W1/m) maximums are obtained, (H1/m) × (W1/m) maximums form the eigenvalue after the characteristic degradation of the channel, and the eigenvalue after the characteristic degradation of all the channels forms the current granularity m²And (5) a feature vector after feature degradation, wherein m is an integer not less than 1.

2. The method according to claim 1, wherein the performing the target classification calculation by using the second feature vectors at different granularities respectively comprises:

and respectively carrying out channel-based global average pooling on the second feature vectors under different granularities to obtain a global average feature value of each channel under different granularities, and respectively adopting the global average feature values of all the channels under different granularities to carry out target classification calculation under different granularities so as to obtain the probability that the image to be recognized contains each target class under different granularities.

3. The method of claim 1, wherein the image to be identified is a pipe image and the object class is a pipe defect class.

4. The method of claim 3, wherein the pipe defect categories include: blind joints, deformations, misconnections, wall debris, penetrations, corrosion, scum, scale, undulations, roots, disjointing, shedding, obstructions, faults, deposits, leaks, or/and cracks.

5. A method of training a neural network for target class recognition, the method comprising:

when the neural network converges, taking the neural network at the moment as a neural network for final use;

let the current particle size be m²All eigenvalues in each channel of the first eigenvector are then divided into m²Sub-regions, each sub-region containing a number of eigenvalues of (H1/m) × (W1/m), for m²M at the same position in a sub-region²A characteristic value of m²The maximum eigenvalue of the eigenvalues is taken, so that (H1/m) × (W1/m) maximum values are obtained, (H1/m) × (W1/m) maximum values form the eigenvalue of the channel after characteristic degradation, and the eigenvalue of all channels after characteristic degradation forms the current granularity m²And (5) a feature vector after feature degradation, wherein m is an integer not less than 1.

6. The method of claim 5, wherein the respectively inputting the second feature vectors at different granularities into the fully-connected layer of the neural network at the corresponding granularity for performing the target classification calculation comprises:

7. The method of claim 5, wherein after the obtaining the image in the multi-frame target class recognition scene as the training image and before the sequentially inputting the training image into the backbone network of the neural network for feature extraction, further comprising:

8. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, cause the processor to perform the steps of the target class recognition method of any one of claims 1 to 4 or the steps of the training method for a neural network for target class recognition of any one of claims 5 to 7.