CN114612414A

CN114612414A - Image processing method, model training method, device, equipment and storage medium

Info

Publication number: CN114612414A
Application number: CN202210216643.4A
Authority: CN
Inventors: 黄钟毅; 高斌斌; 刘俊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2022-06-10

Abstract

The embodiment of the application provides an image processing method, a model training method, a device, equipment and a storage medium, and relates to the technical field of artificial intelligence. The embodiment of the application can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like. The method comprises the following steps: acquiring a relevant feature map corresponding to the target image, wherein the relevant feature map is used for representing the correlation between the target object to be counted and the target image; preprocessing the related feature map to obtain an initial block feature map comprising a plurality of feature blocks; performing density map prediction according to the initial block feature map to obtain a density map corresponding to the target image; the density map prediction comprises at least one block processing operation, and the block processing operation is used for recovering the corresponding spatial resolution of the input block feature map based on the learnable parameters; based on the density map, a predicted number of target objects contained in the target image is determined. The embodiment of the application reduces the prediction error of the number of objects in the image.

Description

Image processing method, model training method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to an image processing method, a model training method, a device, equipment and a storage medium.

Background

The image processing technology can be applied to various application scenes, such as all-object counting (namely small sample counting) by adopting an image processing model.

In the related art, when density map prediction is performed, only some convolution operations and interpolation algorithms are used for predicting the density map, and then the predicted number of objects in an image is determined based on the predicted density map. However, the density map prediction is performed only by some convolution operations and non-learnable interpolation algorithms, and the learning effect is not good, so that the prediction error of the number of objects in the image is large.

Disclosure of Invention

The embodiment of the application provides an image processing method, a model training method, a device, equipment and a storage medium, and reduces prediction errors of the number of objects in an image. The technical scheme is as follows:

according to an aspect of an embodiment of the present application, there is provided an image processing method including:

acquiring a relevant feature map corresponding to a target image, wherein the relevant feature map is used for representing the correlation between a target object to be counted and the target image;

preprocessing the related feature map to obtain an initial block feature map comprising a plurality of feature blocks;

performing density map prediction according to the initial block feature map to obtain a density map corresponding to the target image; wherein the density map prediction comprises at least one block processing operation for restoring a spatial resolution corresponding to an input block feature map based on a learnable parameter;

determining a predicted number of the target objects contained in the target image based on the density map.

According to an aspect of an embodiment of the present application, there is provided a training method of an image processing model, the image processing model including a feature extraction and correlation calculation network, a block embedding network, and a density map prediction network, the method including:

acquiring a relevant feature map corresponding to the sample image through the feature extraction and correlation calculation network, wherein the relevant feature map is used for representing the correlation between the sample object to be counted and the sample image;

preprocessing the related feature map through the block embedded network to obtain an initial block feature map comprising a plurality of feature blocks;

performing density map prediction according to the initial block feature map through the density map prediction network to obtain a predicted density map corresponding to the sample image; wherein the density map prediction comprises at least one block processing operation for restoring a spatial resolution corresponding to the input block feature map based on a learnable parameter;

training the image processing model based on the predicted density map and an actual density map of the sample image for the sample object.

According to an aspect of an embodiment of the present application, there is provided an image processing apparatus including:

the characteristic diagram acquisition module is used for acquiring a relevant characteristic diagram corresponding to a target image, and the relevant characteristic diagram is used for representing the correlation between a target object to be counted and the target image;

the block embedding module is also used for preprocessing the related characteristic diagram to obtain an initial block characteristic diagram containing a plurality of characteristic blocks;

the density map prediction module is used for predicting a density map according to the initial block feature map to obtain a density map corresponding to the target image; wherein the density map prediction comprises at least one block processing operation for restoring a spatial resolution corresponding to the input block feature map based on a learnable parameter;

a quantity determination module to determine a predicted quantity of the target object contained in the target image based on the density map.

According to an aspect of an embodiment of the present application, there is provided an apparatus for training an image processing model, the image processing model including a feature extraction and correlation calculation network, a block embedding network, and a density map prediction network, the apparatus including:

the characteristic diagram acquisition module is used for acquiring a relevant characteristic diagram corresponding to the sample image through the characteristic extraction and correlation calculation network, and the relevant characteristic diagram is used for representing the correlation between the sample object to be counted and the sample image;

the block embedding module is used for preprocessing the related characteristic diagram through the block embedding network to obtain an initial block characteristic diagram containing a plurality of characteristic blocks;

the density map prediction module is used for performing density map prediction according to the initial block feature map through the density map prediction network to obtain a density map corresponding to the sample image; wherein the density map prediction comprises at least one block processing operation for restoring a spatial resolution corresponding to an input block feature map based on a learnable parameter, the predicted density map being used to determine a predicted number of the sample objects contained in the sample image;

a model training module to train the image processing model based on the predicted density map and an actual density map of the sample image for the sample object.

According to an aspect of embodiments of the present application, there is provided a computer device, the computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the above-mentioned image processing method, or to implement the above-mentioned training method of an image processing model.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, which is loaded and executed by a processor to implement the above-mentioned image processing method, or to implement the above-mentioned training method of an image processing model.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the image processing method or the training method of the image processing model.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

the method comprises the steps of preprocessing a related feature map of an image to obtain an initial block feature map containing a plurality of feature blocks, restoring the spatial resolution of the initial block feature map based on a block processing operation process, wherein the process of restoring the spatial resolution of the block feature map by the block processing operation is learnable, so that the final prediction error of a density map is reduced, and the prediction error of the number of objects in the image is further reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by one embodiment of the present application;

FIG. 2 is a flow chart of an image processing method provided by an embodiment of the present application;

FIG. 3 is a flowchart of the block processing operations provided by one embodiment of the present application;

FIG. 4 is a schematic diagram of block processing operations provided by one embodiment of the present application;

FIG. 5 is a flow chart of an image processing method provided by another embodiment of the present application;

FIG. 6 is a flowchart of an image processing method according to another embodiment of the present application;

FIG. 7 is a flow chart of a method for training an image processing model provided by an embodiment of the present application;

FIG. 8 is a flow chart for obtaining a correlation feature map provided by an embodiment of the present application;

FIG. 9 is a flow chart for obtaining a correlation feature map according to another embodiment of the present application;

FIG. 10 is a schematic diagram of an industrial AI defect counting system provided by one embodiment of the subject application;

FIG. 11 is a schematic view of an agricultural insect pest visual AI counting system provided by one embodiment of the present application;

fig. 12 is a block diagram of an image processing apparatus according to an embodiment of the present application;

FIG. 13 is a block diagram of an apparatus for training an image processing model according to an embodiment of the present application;

FIG. 14 is a block diagram of a computer device provided by one embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of methods consistent with aspects of the present application, as detailed in the appended claims.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, counting and measurement on targets, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, mapping, and the like.

The embodiment of the application adopts a computer vision technology, and the number of the target objects contained in the target image is predicted through image processing, so that automatic counting is realized.

Referring to fig. 1, a schematic diagram of an implementation environment provided by an embodiment of the present application is shown. The real-time environment may be implemented as an image processing system. The system may include a model training apparatus 10 and a model using apparatus 20.

The model training device 10 may be an electronic device such as a computer, server, intelligent robot, or some other electronic device with greater computing power. The model training apparatus 10 is used to train the image processing model 40. In the embodiment of the present application, the image processing model 40 is a neural network model for automatic counting, and the model training apparatus 10 may train the image processing model 40 in a machine learning manner so as to have good performance.

The trained image processing model 40 described above may be deployed for use in the model-using device 20 to provide image processing results (i.e., auto-counting results). The model using device 20 may be a terminal device such as a PC (personal computer), a tablet computer, a smartphone, a wearable device, an intelligent robot, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, an aircraft, a medical device, or a server, which is not limited in this application.

In some embodiments, as shown in FIG. 1, the image processing model 40 may include: a feature extraction and correlation computation network 11, a block embedding network 12 and a density map prediction network 13. The feature extraction and correlation calculation network 11 may be a neural network for extracting a correlation feature map between the labeled target object and the target image from the target image. The block embedding network 12 may also be a neural network, and is configured to pre-process the correlation feature map output by the feature extraction and correlation calculation network 11 to obtain an initial block feature map. The density map prediction network 13 is configured to perform density map prediction on the initial block feature map output by the feature extraction and correlation calculation network 11 to obtain a density map corresponding to the target image.

The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, driving assistance and the like.

In the following, the technical solution of the present application will be described by several embodiments.

Referring to fig. 2, a flowchart of an image processing method according to an embodiment of the present application is shown. In the present embodiment, this method is exemplified by being applied to the model using apparatus 20 described above. The method comprises the following steps (201-204):

step 201, obtaining a relevant feature map corresponding to a target image.

The related feature map is used for representing the correlation between the target object to be counted and the target image. One or more types of objects may be included in the target image, and the target object to be counted may be one type of object among them. In some embodiments, one or more target objects in the target image are labeled (e.g., bounding box labeled) to indicate which type the target object to be counted belongs to.

For example, the target image may be an image of an industrial product, and the target object may be the industrial product in the target image, or may be some type of defect (e.g., crack, dent, bump) of the industrial product; if the target image is marked with a plurality of cracks, the target object to be counted is a crack defect of the industrial product, and other types of defects such as pit defects, bump defects and the like do not belong to the target object to be counted.

For another example, the target image may be an image related to agricultural product production, which type of pest is marked in the target image, the target object to be counted is the type of pest, and other types of pest do not belong to the target object to be counted.

For another example, the target image may be an image including two animals, i.e., a chicken group and a rabbit group, and if the target image is labeled with chickens, the target object to be counted is the chickens; if the rabbit is marked in the target image, the target object to be counted is the rabbit.

In some embodiments, the related feature map is used to characterize the correlation between the labeled target object and the target image, and the labeled target object is labeled in a manner of a bounding box. The extraction process of the relevant feature map can refer to the following formula one:

F＝M(X,E)

wherein, F represents a related characteristic diagram, and F is equal to R^h×w×cH, w and c are respectively the height, width and channel number of the relevant characteristic diagram; m represents a feature extraction and correlation computation network; x represents a target image of an input feature extraction and correlation calculation network, and belongs to XR^H×W×CH and W represent the high and wide resolutions of the target image, and C represents the number of color channels (e.g., 1, 3, etc.) of the target image; e denotes the labeled target object.

In some embodiments, instead of labeling the target object in the target image, the image corresponding to the target object and the target image may be subjected to feature extraction in parallel to obtain the relevant feature map. That is, what is input to the model may be a target image and an image of at least one target object. For example, if the number of rabbits in the target image needs to be determined, a picture of the rabbit and the target image may be input into the model together, so as to obtain a correlation feature map for characterizing the correlation between the picture of the rabbit and the target image.

In some embodiments, the feature extraction and correlation calculation network for extracting the relevant feature map may be a feature extraction and correlation calculation network based on a FamNet universal counting framework, which may be replaced by other network results capable of extracting relevant features between a given sample target (i.e. an annotated target object) and a target image, and this is not particularly limited in this embodiment of the present application. The feature extraction operation before obtaining the relevant feature map may be implemented by ResNet50, ResNeXt, PVT, and the like, which is not specifically limited in this embodiment of the present application.

Step 202, preprocessing the related feature map to obtain an initial block feature map including a plurality of feature blocks.

In some embodiments, the correlation feature map is preprocessed, and the correlation feature map may be divided into a plurality of feature blocks, so as to obtain an initial block feature map including the plurality of feature blocks. Optionally, the spatial resolution of the relevant feature map is lower than the spatial resolution of the target image.

And step 203, performing density map prediction according to the initial block feature map to obtain a density map corresponding to the target image.

The density map prediction includes at least one block processing operation, that is, the density map prediction includes one or more steps of block processing operations (may be simply referred to as "block processing") for restoring the spatial resolution of the block feature map based on the learnable parameters. In the case where the density map prediction includes multiple block processing operations, the output of the previous block processing operation is the input to the next block processing operation. The spatial resolution of the relevant feature map is lower than that of the target image, and the spatial resolution can be gradually restored by performing the block processing operation step by step, until the feature map with the same spatial resolution as the target image is obtained, the block processing operation can be stopped. Based on the obtained feature map having the same spatial resolution as the target image, a density map can be obtained. The density map is a feature map for representing the density of the target object in the image, and the density map includes a plurality of density values. Alternatively, one pixel in the density map corresponds to one density value, and the density value is used to indicate the possibility that the target object exists at the corresponding pixel position.

In some embodiments, the density map is generated by an image processing model comprising:

the characteristic extraction and correlation calculation network is used for acquiring a correlation characteristic diagram corresponding to the target image;

the block embedded network is used for preprocessing the related characteristic graph to obtain an initial block characteristic graph;

and the density map prediction network is used for performing density map prediction according to the initial block feature map to obtain a density map corresponding to the target image.

Optionally, the block embedding network and the density map prediction network are jointly implemented as a density map predictor, wherein the feature map (i.e. the initial block feature map) output by the block embedding network is also referred to as block embedding.

Step 204, based on the density map, determines a predicted number of target objects contained in the target image.

In some embodiments, summing the density values in the density map may determine a predicted number of target objects contained in the target image.

In summary, according to the technical solution provided in the embodiment of the present application, an initial block feature map including a plurality of feature blocks is obtained by preprocessing a relevant feature map of an image, and a spatial resolution of the initial block feature map is restored based on a block processing operation process.

In addition, in the embodiment of the application, the labeling amount in the target image does not need to be too much, and a small number of labeled target objects, such as 3,4, 5, and the like, exist in the target image, so that the predicted number of the target objects can be obtained, and thus the total object count (also referred to as small sample count) is realized.

Referring to fig. 3, a flowchart illustrating block processing operations provided by an embodiment of the present application is shown. In the present embodiment, this method is exemplified by being applied to the model using apparatus 20 described above. The block processing operation may include the following steps (301-302):

step 301, performing deformation processing on the input block feature map to obtain a deformed block feature map.

In some embodiments, the warping process is used to reform the input block feature map, changing the position of some or all of the feature blocks.

In some embodiments, as shown in FIG. 4, this step 301 further includes the following substeps (1-4):

1. and recombining the input block feature maps to obtain a recombined block feature map.

In some embodiments, the product of the height and width of the re-binned block feature map is equal to the length of the input block feature map, but the number of passes remains the same. As shown in FIG. 4, the input block feature map 21 may be expressed as

X^mHas a length of h_m·w_mThe number of channels is c_m(ii) a The re-binned block feature map 22 may be represented as

Is h_mWidth of w_mThe number of channels is c_m。

2. And dividing the recombined block feature graph into m feature subgraphs, wherein m is an integer larger than 1.

In some embodiments, the reorganized block feature map is divided into m feature subgraphs based on the channel dimensions of the reorganized block feature map; and the number of channels of the feature subgraph is m times of the number of channels of the reconstructed block feature graph. Alternatively, when m is 4, the feature subgraph can be represented as

k ∈ {1,2,3,4}, the kth feature subgraph

The calculation process of (2) can refer to the following formula two:

the second expression uses a representation based on the tensor slice syntax in the PyTorch framework. Illustratively, when k is 1, the 1 st characteristic subgraph is represented by 0 to 0 in channel dimension

All the feature blocks of (1) to form a feature map; when k is 2, the 2 nd characteristic subgraph is represented in the channel dimension

To

All the feature blocks of (1) constitute a feature map.

As shown in fig. 4, along the channel dimension, the reorganized block feature map 22 is divided into 4 feature sub-maps 23 on average, the number of channels of each feature sub-map 23 is 1/4 of the number of channels of the reorganized block feature map 22, that is, the number of channels of each feature sub-map 23 is 1/4

3. And carrying out staggered combination on the feature blocks in the m feature subgraphs to generate a combined block feature graph.

In some embodiments, interleaving merging is used to split and recombine m feature subgraphs in units of feature blocks.

In some embodiments, the feature blocks in the m feature subgraphs are combined in a staggered manner to generate a combined block feature graph, and the combined block feature graph comprises the following sub-steps (3.1-3.3):

3.1 traversing each position in the m characteristic subgraphs according to a first sequence, and extracting characteristic blocks at the same position from the m characteristic subgraphs each time to obtain m characteristic blocks;

3.2 recombining the m feature blocks according to a second sequence to obtain a feature block group;

and 3.3, combining the feature block groups respectively corresponding to the positions to obtain a combined block feature map.

Alternatively, the calculation process of interleaving and merging may refer to the following formula three:

wherein the content of the first and second substances,

a merged block feature map 24 as shown in figure 4 is shown. Equation three uses a representation based on the tensor slice syntax in the PyTorch framework,% represents the modulo operation,

indicating a rounding down. The meaning of equation three can be described by way of example as follows:

as shown in fig. 4, m is 4, and 4 feature blocks located at the upper left corner in the 4 feature subgraphs are extracted, where: feature blocks 25, feature blocks 26, feature blocks 27, and feature blocks 28, in a second order, feature blocks 25 from the 1 st feature sub-graph are placed in the 1 st row and 1 st column of the feature block group, feature blocks 26 from the 2 nd feature sub-graph are placed in the 2 nd row and 1 st column of the feature block group, feature blocks 27 from the 3 rd feature sub-graph are placed in the 1 st row and 2 nd column of the feature block group, and feature blocks 28 from the 4 th feature sub-graph are placed in the 2 nd row and 2 nd column of the feature block group 29, thereby obtaining a feature block group 29. Since the feature block group 29 is composed of the feature blocks in the upper left corner of the feature sub-graph, the feature block group 29 is located in the upper left corner of the merged block feature graph 24.

It can be seen that the number of channels of the combined block feature graph is the same as that of the feature subgraph, and is m times. As shown in fig. 4, m is 4, and the number of channels in the combined block feature map 29 and the number of channels in each feature sub-map 23 are 1/4 of the number of channels in the re-combined block feature map 22; the width of the merged block feature map 24 is 2 times the width of the reorganized block feature map 22; the height of the merged block feature map 24 is 2 times the height of the reorganized block feature map 22.

4. And expanding the combined block feature diagram based on the width and the height to obtain a deformed block feature diagram.

As shown in fig. 4, the combined block feature map 29 is expanded based on the width and height to obtain a deformed block feature map 30. The number of channels of the deformed block feature map 30 is the same as the number of channels of the merged block feature map 29, and the length 4h of the deformed block feature map 30_m·w_mWidth 2w of the merged block feature map 29_mAnd a height of 2h_mThe product of (a). The deformed block feature map 30 may be represented as

And step 302, mapping the deformed block feature map to obtain a mapped block feature map.

And the spatial resolution corresponding to the mapped block feature map is greater than the spatial resolution corresponding to the input block feature map, and the mapped block feature map is used for generating a density map. And the length of the mapped block feature map is larger than that of the input block feature map. Therefore, the spatial resolution of the mapped block feature map corresponding to the three-dimensional shape is greater than the spatial resolution of the input block feature map corresponding to the three-dimensional shape.

In some embodiments, the transformed block feature map is mapped based on the learnable parameters in the first mapping layer, and a mapped block feature map is obtained. And the number of channels of the mapped block feature map is greater than that of the deformed block feature map.

In some embodiments, as shown in FIG. 4, the deformed block feature map 30 (i.e., the block feature map is shown in FIG. 4)

) Input Linear mapping layer Linear₁(. to) thereby generate a mapped block feature map

Wherein the content of the first and second substances,

it can be seen that, after the block processing operation, the change conditions of the height, width and channel number of the block feature map corresponding to the three-dimensional shape are as follows:

the first mapping layer is a network with adjustable parameters (i.e., learning), so that the flip block combination provided by the embodiment of the application can restore the spatial resolution of the input feature map corresponding to the three-dimensional shape in a learnable manner, thereby improving the image processing capability of the whole model and further reducing the prediction error of the number of objects in the image.

Referring to fig. 5, a flowchart of an image processing method according to another embodiment of the present application is shown. In the present embodiment, this method is exemplified by being applied to the model using apparatus 20 described above. The method comprises the following steps (501-507):

step 501, acquiring a relevant characteristic diagram of a target image to be processed.

The content of step 501 is the same as or similar to the content of step 201 in the embodiment of fig. 2, and is not described herein again.

Step 502, performing block division on the relevant feature map to obtain a plurality of pixel blocks.

In some embodiments, a single pixel block is set to a size (patch size) of p, i.e., the spatial resolution of the pixel block is p × p, and the pixels in the correlation feature map are divided based on the size of the pixel block.

Step 503, flattening the plurality of pixel blocks respectively to obtain a plurality of flattened pixel blocks.

Alternatively, the plurality of flattened pixel blocks may be represented as F^pWherein, in the step (A),

i.e. the number of flattened pixel blocks, p²C is the length of each flattened pixel block. For example, if the spatial resolution of the correlation feature map is 100 × 100 (i.e., the width and the height are both 100), the size p of the pixel block is 10, i.e., the spatial resolution of each pixel block is 10 × 10, and the number of the flattened pixel blocks is (100 × 100)/10²100, the length of each flattened pixel block is 10²X c is 100 c. The 100 flattened pixel blocks are represented as

Step 504, based on the learnable parameters in the second mapping layer, mapping the flattened pixel blocks to obtain an initial block feature map.

In some embodiments, the plurality of flattened pixel blocks are input to the second Linear mapping layer Linear₂In (-) generate Block embedding (i.e., initial Block feature map) X¹＝Linear₂(F^p) Wherein, in the step (A),

and is provided with

c₁Representing the dimensions of the initial block feature map. Dimension c of the initial block feature map₁May be determined by setting parameters for the second mapping layer. Optionally, dimension c of the initial block feature map₁The dimension c of the feature map is smaller than that of the related feature map, namely, the second mapping layer can reduce the number of channels of the feature map, so that the data volume needing to be processed is reduced for subsequent density map prediction, the computing resources are saved, and the performance pressure of computer equipment is reduced. Or, dimension c of the initial block feature map₁The dimension c of the feature map can be larger than the dimension c of the related feature map, namely, the number of channels of the feature map can be increased by the second mapping layer, so that the feature expression capability of the model is enhanced, and the prediction error of the number of the objects in the image is reduced.

And 505, performing density map prediction according to the initial block feature map to obtain a final block feature map.

In some embodiments, the spatial resolution of the feature map is restored through a density map prediction process, and in a case that the resolution of the feature map has been restored to the spatial resolution of the target image, the feature map is a final block feature map, and if the number of channels of the final block feature map is still greater than 1 at this time, the final block feature map cannot be called a required density map (that is, the obtained density map is not yet a density map when the spatial resolution is restored).

And step 506, performing convolution on the final block feature map to obtain a density map.

Alternatively, if the number of channels of the final block feature map is still greater than 1, performing 1 × 1 convolution on the final block feature map, so as to generate a density map with the number of channels being 1 while maintaining the spatial resolution of the final block feature map, so as to determine the predicted number of target objects contained in the target image based on the density map.

In some possible implementations, the density map prediction further includes at least one encoding/decoding process, where the encoding/decoding process is used to perform a feature extraction process on the block feature map obtained by the block processing operation.

In some embodiments, the codec processing includes: and adopting an attention-based coding and decoding network to carry out attention mining on the block characteristic diagram obtained by the block processing operation to obtain the adjusted block characteristic diagram. Wherein the adjusted block feature map is used to generate a density map. Optionally, the codec network based on the attention mechanism is a window-based transform block, and self-attention mining and learning can be performed based on the transform block, so that the prediction error of the density map is reduced, and the prediction error of the number of objects in the image is reduced.

In some embodiments, as shown in fig. 6, the density map prediction includes a k-step block processing operation 32 and a k-step coding and decoding process 33 which are executed in an interleaved manner, and the ith step of block processing operation is followed by the ith step of coding and decoding process 33, k being an integer greater than 1, and i being a positive integer less than or equal to k. Optionally, the input to the block processing operation 32 of step i is the output of the codec processing 33 of step i-1; the input to the first step block processing operation 32 is an initial block feature map. Optionally, the one-step block processing operation 32 and the one-step coding/decoding processing 33 are a stage, and the final block feature map can be obtained step by step through k stages. Based on the above description, the block processing operation involved in the embodiments of the present application is actually a reverse order block merging process, and thus the block processing operation process may also be referred to as flip block merging, and the module performing flip block merging may be referred to as a flip block merging module.

In step 507, the predicted number of the target objects contained in the target image is determined based on the density map.

The content related to step 507 may refer to the content of step 204 in the embodiment of fig. 2, and is not described herein again.

In summary, according to the technical scheme provided by the embodiment of the present application, the feature map is partitioned and flattened, so that in the subsequent density map prediction process, a pixel block (i.e., a pixel block including a plurality of pixels) can be used as a minimum unit to perform block processing operation, thereby reducing the amount of calculation required in the process of merging the turning blocks, and further saving the processing resources required by image processing.

Referring to fig. 7, a flowchart of a training method of an image processing model according to an embodiment of the present application is shown. In the present embodiment, this method is exemplified by being applied to the model using apparatus 20 described above. The image processing model comprises a feature extraction and correlation calculation network, a block embedding network and a density map prediction network, and the method comprises the following steps (701-704):

step 701, a relevant feature map corresponding to the sample image is obtained through feature extraction and a relevance calculation network, and the relevant feature map is used for representing the relevance between the sample object to be counted and the sample image.

In some embodiments, only a sample image of the feature extraction and correlation computation network is input, the sample image including at least one sample object of the bounding box annotation. In such a case, as shown in fig. 8, the feature extraction and correlation calculation network includes a feature extraction network 62 and a correlation calculation module 67, and the flow of the feature extraction and correlation calculation network acquiring the correlation feature map is as follows: inputting a sample image 61 containing a sample object boundary box label into a feature extraction network 62 to obtain an image feature 63 of the sample image, cutting out a feature 65 of the labeled sample object from the image feature 63 of the sample image based on the labeled sample object boundary box, and performing correlation calculation on the image feature 63 of the sample image and the feature 65 of the sample object through a correlation calculation module 67 to obtain a correlation feature map 66.

In some embodiments, an image of a feature extraction and correlation computation network is input that includes a sample image and at least one sample object. In such a case, as shown in fig. 9, the feature extraction and correlation calculation network includes a feature extraction network 62 and a correlation calculation module 67, and the flow of the feature extraction and correlation calculation network acquiring the correlation feature map is as follows: the sample image 68 and the image 70 of the sample object are respectively input into the feature extraction network 62 to obtain the image features 69 of the sample image and the features 71 of the sample object, and the image features 69 of the sample image and the features 71 of the sample object are subjected to correlation calculation by the correlation calculation module 67 to obtain a correlation feature map 72.

Step 702, preprocessing the related feature map through the block embedded network to obtain an initial block feature map including a plurality of feature blocks.

And 703, performing density map prediction according to the initial block feature map through a density map prediction network to obtain a predicted density map corresponding to the sample image.

Wherein the density map prediction comprises at least one block processing operation for restoring the spatial resolution of the block feature map based on the learnable parameters. Optionally, the predicted density map is used to determine a predicted number of sample objects contained in the sample image.

In some embodiments, the block processing operations include: carrying out deformation processing on the input block feature diagram to obtain a deformed block feature diagram; and mapping the deformed block feature map to obtain a mapped block feature map. The spatial resolution corresponding to the mapped block feature map (i.e., the spatial resolution after corresponding to the three-dimensional shape) is greater than the spatial resolution corresponding to the input block feature map (i.e., the spatial resolution after corresponding to the three-dimensional shape), and the mapped block feature map is used to generate the predicted density map.

In some embodiments, the transforming the input block feature map to obtain a transformed block feature map includes:

1. recombining the input block feature maps to obtain recombined block feature maps, wherein the product of the height and the width of the recombined block feature maps is equal to the length of the input block feature maps;

2. dividing the recombined block feature graph into m feature subgraphs, wherein m is an integer larger than 1;

3. carrying out staggered combination on the feature blocks in the m feature sub-images to generate a combined block feature image, wherein the staggered combination is used for splitting and recombining the m feature sub-images by taking the feature blocks as units;

4. and expanding the combined block feature map based on the width and the height to obtain a deformed block feature map.

For the related introduction of steps 701 to 703, reference may be made to the above contents, which are not described herein again.

Step 704, training an image processing model based on the predicted density map and the actual density map of the sample image for the sample object.

In some embodiments, training the image processing model may employ MSE (Mean Square Error) loss or other types of regression loss.

The image processing model provided by the embodiment of the application can be suitable for counting the target objects in various small sample scenes.

As shown in fig. 10, in the industrial AI quality inspection system, when a trained defect counting system 35 is required to process some newly appeared defect types with model unseen (unseen), the defect counting system 35 can be allowed to count the new class of defects globally at very low cost by giving a very small number of bounding box labels about the new class of defects.

As shown in fig. 11, in agricultural pest visual AI enumeration system 36, a very large variety of pest types may be present. In this application scenario, training of an image processing model (i.e., a universal counting model) for some of the pest types may be considered. In the test application stage, for newly appeared pest types, counting of the new types of pests can be realized without retraining the model by only giving a plurality of target boundary frames of the types of pests.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 12, a block diagram of an image processing apparatus according to an embodiment of the present application is shown. The apparatus has a function of implementing the above-described example of the image processing method. The apparatus 1200 may be or may be provided on a model using device. The apparatus 1200 may include:

a feature map obtaining module 1210, configured to obtain a relevant feature map corresponding to a target image, where the relevant feature map is used to characterize a correlation between a target object to be counted and the target image;

a block embedding module 1220, configured to pre-process the relevant feature map to obtain an initial block feature map including a plurality of feature blocks;

a density map prediction module 1230, configured to perform density map prediction according to the initial block feature map to obtain a density map corresponding to the target image; wherein the density map prediction comprises at least one block processing operation for restoring a spatial resolution corresponding to the input block feature map based on a learnable parameter;

a number determination module 1240 for determining the predicted number of the target objects contained in the target image based on the density map.

In an exemplary embodiment, the density map prediction module 1230 is configured to:

carrying out deformation processing on the input block feature diagram to obtain a deformed block feature diagram;

mapping the deformed block feature map to obtain a mapped block feature map;

and the spatial resolution corresponding to the mapped block feature map is greater than the spatial resolution corresponding to the input block feature map, and the mapped block feature map is used for generating the density map.

recombining the input block feature map to obtain a recombined block feature map, wherein the product of the height and the width of the recombined block feature map is equal to the length of the input block feature map;

dividing the recombined block feature graph into m feature subgraphs, wherein m is an integer larger than 1;

carrying out staggered combination on the feature blocks in the m feature sub-images to generate a combined block feature image, wherein the staggered combination is used for splitting and recombining the m feature sub-images by taking the feature blocks as units;

and expanding the combined block feature map based on width and height to obtain the deformed block feature map.

dividing the recombined block feature map into the m feature subgraphs based on the channel dimension of the recombined block feature map;

and the number of channels of the feature subgraph is m times of the number of channels of the recombined block feature graph.

traversing each position in the m characteristic subgraphs according to a first sequence, and extracting characteristic blocks at the same position from the m characteristic subgraphs each time to obtain m characteristic blocks;

recombining the m feature blocks according to a second sequence to obtain a feature block group;

and combining the feature block groups corresponding to the positions respectively to obtain the combined block feature diagram.

mapping the deformed block feature map based on learnable parameters in a first mapping layer to obtain the mapped block feature map; and the number of channels of the mapped block feature map is greater than that of the deformed block feature map.

In an exemplary embodiment, the density map prediction further includes at least one encoding/decoding process, where the encoding/decoding process is configured to perform a feature extraction process on the block feature map obtained by the block processing operation.

In an exemplary embodiment, the codec processing includes: adopting an attention-based coding and decoding network to carry out attention mining on the block feature graph obtained by the block processing operation to obtain an adjusted block feature graph; wherein the adjusted block feature map is used to generate the density map.

In an exemplary embodiment, the density map prediction includes a k-step block processing operation and a k-step coding and decoding process performed in an interleaved manner, and the i-step coding and decoding process is performed after the i-step block processing operation, k is an integer greater than 1, and i is a positive integer less than or equal to k.

In the exemplary embodiment, said density map prediction module 1230 is configured to:

performing density map prediction according to the initial block feature map to obtain a final block feature map;

and performing convolution on the final block feature map to obtain the density map.

In an exemplary embodiment, the feature map obtaining module 1210 is configured to:

carrying out block division on the relevant characteristic graph to obtain a plurality of pixel blocks;

flattening the pixel blocks respectively to obtain a plurality of flattened pixel blocks;

and mapping the flattened pixel blocks based on learnable parameters in the second mapping layer to obtain the initial block feature map.

In an exemplary embodiment, the density map is generated by an image processing model comprising:

the feature extraction and correlation calculation network is used for acquiring a correlation feature map corresponding to the target image;

the block embedded network is used for preprocessing the related characteristic graph to obtain the initial block characteristic graph;

and the density map prediction network is used for predicting the density map according to the initial block feature map to obtain a density map corresponding to the target image.

Referring to fig. 13, a block diagram of an apparatus for training an image processing model according to an embodiment of the present application is shown. The device has a function of implementing the training method example of the image processing model. The apparatus 1300 may be the model training device described above, or may be disposed on the model training device. The image processing model includes a feature extraction and correlation computation network, a block embedding network and a density map prediction network, and the apparatus 1300 may include:

a feature map obtaining module 1310, configured to obtain a relevant feature map corresponding to the sample image through the feature extraction and correlation calculation network, where the relevant feature map is used to characterize a correlation between a sample object to be counted and the sample image;

a block embedding module 1320, configured to preprocess the relevant feature map to obtain an initial block feature map including a plurality of feature blocks;

a density map prediction module 1330 configured to perform density map prediction according to the initial block feature map to obtain a predicted density map corresponding to the sample image; wherein the density map prediction comprises at least one block processing operation for restoring a spatial resolution corresponding to an input block feature map based on a learnable parameter, the predicted density map being used to determine a predicted number of the sample objects contained in the sample image;

a model training module 1340 for training the image processing model based on the predicted density map and the actual density map of the sample image for the sample object.

In an exemplary embodiment, the density map prediction module 1330 is configured to:

mapping the deformed block feature map to obtain a mapped block feature map;

and expanding the combined block feature diagram based on width and height to obtain the deformed block feature diagram.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 14, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device is used for implementing the image processing method or the training method of the image processing model provided in the above embodiments. Specifically, the method comprises the following steps:

the computer apparatus 1400 includes a CPU (Central Processing Unit) 1401, a system Memory 1404 including a RAM (Random Access Memory) 1402 and a ROM (Read-Only Memory) 1403, and a system bus 1405 connecting the system Memory 1404 and the Central Processing Unit 1401. The computer device 1400 also includes a basic I/O (Input/Output) system 1406 that facilitates transfer of information between devices within the computer, and a mass storage device 1407 for storing an operating system 1413, application programs 1414, and other program modules 1415.

The basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1408 and input device 1409 are both connected to the central processing unit 1401 via an input-output controller 1410 connected to the system bus 1405. The basic input/output system 1406 may also include an input/output controller 1410 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1410 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1407 is connected to the central processing unit 1401 via a mass storage controller (not shown) that is connected to the system bus 1405. The mass storage device 1407 and its associated computer-readable media provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer readable medium (not shown) such as a hard disk or CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory, CD-ROM, DVD (Digital Video Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1404 and mass storage device 1407 described above may collectively be referred to as memory.

According to various embodiments of the present application, the computer device 1400 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the computer device 1400 may be connected to the network 1412 through the network interface unit 1411 connected to the system bus 1405, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1411.

In an exemplary embodiment, there is further provided a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which when executed by a processor, implements the above-mentioned image processing method or training method of an image processing model.

Optionally, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State drive), or optical disk. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product is also provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the image processing method or the training method of the image processing model described above.

It is understood that, in the specific implementation of the present application, related data such as object information is referred to, when the above embodiments of the present application are applied to specific products or technologies, individual permission or individual consent of the object needs to be obtained, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and maps.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method of claim 1, wherein the block processing operation comprises:

mapping the deformed block feature map to obtain a mapped block feature map;

wherein the spatial resolution corresponding to the mapped block feature map is greater than the spatial resolution corresponding to the input block feature map, and the mapped block feature map is used to generate the density map.

3. The method according to claim 2, wherein the transforming the input block feature map to obtain a transformed block feature map comprises:

4. The method of claim 3, wherein the dividing the reorganized block feature map into m feature subgraphs comprises:

5. The method of claim 3, wherein the interleaving and merging the feature blocks in the m feature subgraphs to generate a merged block feature map comprises:

6. The method according to claim 2, wherein the mapping the deformed block feature map to obtain a mapped block feature map comprises:

mapping the deformed block feature map based on learnable parameters in a first mapping layer to obtain the mapped block feature map;

and the number of channels of the mapped block feature map is greater than that of the deformed block feature map.

7. The method according to claim 1, wherein the density map prediction further comprises at least one coding/decoding process for performing a feature extraction process on the block feature map obtained by the block processing operation.

8. The method of claim 7, wherein the codec processing comprises:

adopting an attention-based coding and decoding network to carry out attention mining on the block feature graph obtained by the block processing operation to obtain an adjusted block feature graph;

wherein the adjusted block feature map is used to generate the density map.

9. The method of claim 7, wherein the density map prediction comprises a k-step block processing operation and a k-step coding and decoding process performed in an interleaved manner, and wherein the i-step block processing operation is followed by the i-step coding and decoding process, wherein k is an integer greater than 1, and i is a positive integer less than or equal to k.

10. The method according to claim 1, wherein the performing density map prediction according to the initial block feature map to obtain a density map corresponding to the target image comprises:

11. The method of claim 1, wherein the preprocessing the correlation feature map to obtain an initial block feature map comprising a plurality of feature blocks comprises:

12. The method of any of claims 1 to 11, wherein the density map is generated by an image processing model comprising:

13. A method of training an image processing model, the image processing model comprising a feature extraction and correlation computation network, a block embedding network and a density map prediction network, the method comprising:

performing density map prediction according to the initial block feature map through the density map prediction network to obtain a predicted density map corresponding to the sample image; wherein the density map prediction comprises at least one block processing operation for restoring a spatial resolution corresponding to an input block feature map based on a learnable parameter, the predicted density map being used to determine a predicted number of the sample objects contained in the sample image;

14. The method of claim 13, wherein the block processing operation comprises:

mapping the deformed block feature map to obtain a mapped block feature map;

and the spatial resolution corresponding to the mapped block feature map is greater than the spatial resolution corresponding to the input block feature map, and the mapped block feature map is used for generating the prediction density map.

15. The method according to claim 14, wherein the transforming the input block feature map to obtain a transformed block feature map comprises:

16. An image processing apparatus, characterized in that the apparatus comprises:

the block embedding module is used for preprocessing the related characteristic diagram to obtain an initial block characteristic diagram containing a plurality of characteristic blocks;

17. An apparatus for training an image processing model, wherein the image processing model includes a feature extraction and correlation computation network, a block embedding network, and a density map prediction network, the apparatus comprising:

18. A computer device comprising a processor and a memory, said memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by said processor to implement the image processing method of any of the preceding claims 1 to 12 or to implement the training method of the image processing model of any of the preceding claims 13 to 15.

19. A computer-readable storage medium, having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the image processing method of any of the above claims 1 to 12 or to implement the training method of the image processing model of any of the above claims 13 to 15.

20. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium, which are read by a processor and executed to implement the image processing method of any of the preceding claims 1 to 12 or to implement the training method of the image processing model of any of the preceding claims 13 to 15.