CN111814771B

CN111814771B - Image processing method and device

Info

Publication number: CN111814771B
Application number: CN202010923823.7A
Authority: CN
Inventors: 劳江微; 汪佳; 王剑; 陈景东; 顾欣欣; 孙剑哲; 甘利民; 余泉; 孙晓冬
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2021-01-05
Anticipated expiration: 2040-09-04
Also published as: CN112633185B; CN111814771A; CN112633185A

Abstract

The embodiment of the specification provides an image processing method and device, a spectrum remote sensing image is cut into a plurality of images to be processed for processing, and information of the spectrum remote sensing image is reserved as far as possible. When the spectrum remote sensing image is processed, the spectrum remote sensing image is zoomed on a plurality of scales (corresponding resolutions) according to different resolutions, so that semantic segmentation results of all scales and corresponding attention maps are obtained. Further, the semantic segmentation results are fused by using an attention map. The method introduces an attention map to describe the importance of semantic segmentation, so that the accuracy of a target recognition result is improved.

Description

Image processing method and device

Technical Field

One or more embodiments of the present disclosure relate to the field of computer technology, and more particularly, to a method and apparatus for image processing using a computer.

Background

Object identification is a technique for identifying objects from one or more images, or videos, by a computer. The target recognition can be widely applied to various scenes such as automatic driving, automatic goods replenishment of commodities, vehicle damage recognition, face attendance, self-service shopping and the like. The remote sensing image is a product of information obtained by various sensors and is an information carrier of a remote sensing detection target. The remote sensing image may contain a lot of information such as water, vegetation, land, mountainous areas, etc. from which smaller objects can be distinguished, such as a tree, a person, a traffic sign, a sign in a football stadium, etc. The spectrum remote sensing image is an image acquired by a spectrum-based sensor. However, the remote sensing image generally has a wide coverage area and a large image view, and how to more accurately identify the predetermined target in the processing process is a problem worthy of exploration.

Disclosure of Invention

One or more embodiments of the present specification describe a method and apparatus for image processing to solve one or more of the problems set forth in the background.

According to a first aspect, there is provided a method of image processing for identifying a predetermined target from a spectrally remote sensed image, comprising: acquiring a spectral remote sensing image to be processed; detecting whether cloud clusters and/or cloud shadows exist in the spectral remote sensing image to be processed, and performing cloud cluster removing processing and/or cloud shadow removing processing under the existence condition; cutting the spectral remote sensing image to be processed into a plurality of images to be processed according to a preset size and a preset step length; respectively executing standardization operation aiming at each image channel of each image to be processed, wherein the standardization operation is used for limiting a channel value corresponding to each pixel in a preset range aiming at a single image channel; semantic segmentation is carried out on the plurality of images to be processed respectively based on a pre-trained convolutional neural network, and each target recognition result based on the semantic segmentation is obtained; and splicing the target recognition results according to the position relation of the plurality of images to be processed to obtain the recognition result of the spectral remote sensing image to be processed on the preset target.

In a second aspect, a method for image processing is provided, in which an image to be processed is obtained based on a remote sensing image, and the method is used for identifying a predetermined target from the image to be processed, and includes: carrying out resolution scaling on the image to be processed to obtain a first image corresponding to a first resolution; extracting image features of the first image on a plurality of preset image channels to obtain a plurality of feature maps which are in one-to-one correspondence with the preset image channels; processing the plurality of feature maps through a pre-trained convolutional neural network to obtain a first semantic segmentation result and a first attention map for the first image, wherein the convolutional neural network comprises a basic convolution module, and a semantic convolution module and an attention convolution module which are connected with the basic convolution module in parallel, the first semantic segmentation result is an output result of the semantic convolution module and comprises probability maps respectively corresponding to target categories, a single probability map describes the probability that the first image is mapped onto the image to be processed, each pixel is identified as a single target category in a predetermined target, the first attention map is output by the attention convolution module, and the attention map indicates the importance degree of the semantic segmentation result corresponding to each pixel for the image to be processed; and fusing the first semantic segmentation result with other semantic segmentation results obtained through other images based on the first attention diagram and other attention diagrams corresponding to other semantic segmentation results to determine a recognition result of the image to be processed about a predetermined target, wherein the other images are images which are obtained by performing resolution scaling on the image to be processed and correspond to other resolutions.

In one embodiment, the image to be processed is obtained by segmenting the remote spectrum sensing image according to a predetermined scale.

In one embodiment, before the remote spectrum sensing image is segmented according to a predetermined scale to obtain the image to be processed, at least one of the following preprocessing is performed: cloud cluster removing and cloud shadow removing.

In one embodiment, each image channel of the spectrum remote sensing image is respectively corresponding to each channel mean value and each channel variance which are determined through statistics of a plurality of spectrum remote sensing images; before the resolution scaling is carried out on the image to be processed, the following spectrum standardization processing is also carried out for each image channel: subtracting the corresponding channel mean value from the channel value corresponding to each pixel on a single image channel to obtain a channel difference value corresponding to the single image channel; and determining the normalized value of each pixel on the single image channel according to the channel difference and the channel variance corresponding to the single image channel.

In one embodiment, the image to be processed corresponds to an initial resolution, and the scaling the resolution of the image to be processed to obtain the first image corresponding to the first resolution includes: obtaining a first image corresponding to a first resolution through upsampling under the condition that the first resolution is greater than the initial resolution; and obtaining a first image corresponding to the first resolution by downsampling if the first resolution is smaller than the initial resolution.

In one embodiment, the plurality of predetermined channels include near infrared channels in a traffic scenario where the predetermined target includes at least one of vegetation, a body of water.

In one embodiment, the basic convolution module includes a multi-resolution convolution layer, the multi-resolution convolution layer divides a single convolution channel into a plurality of sub-convolution channels, each sub-convolution channel corresponds to a convolution operation in which resolution is sequentially decreased, and when the single sub-convolution channel is connected to a lower sub-convolution channel, resolution is reduced by a down-sampling method.

In one embodiment, the basic convolution module further includes a fusion layer, and the fusion layer is configured to perform fusion of uniform resolutions on convolution results of the convolution channels to obtain a fusion feature map, where the uniform resolution is upsampling and/or downsampling, and the fusion manner includes at least one of averaging, maximizing, and summing.

In one embodiment, the fusion layer further includes a feature context representation layer for updating feature values of the respective feature points by: determining a pre-classification result of each feature point according to the fused feature map, wherein the pre-classification result indicates a predetermined target class corresponding to each feature point; and for a single feature point, updating the feature value of the single feature point by using the feature value fusion value of a plurality of feature points with the same predetermined target class on the classification result as the single feature point.

In one embodiment, the fusion layer further includes a feature context representation layer for updating feature values of a single feature point by: determining the association degree between a single characteristic point and other various characteristic points through an attention mechanism; and fusing the characteristic values of the characteristic points with the relevance degrees larger than a preset threshold value, and updating the characteristic value of the single characteristic point by using the obtained characteristic value fusion value.

In one embodiment, the fusing the first semantic segmentation result with other semantic segmentation results obtained through other images based on the first attention map and other attention maps corresponding to other semantic segmentation results to determine a recognition result of the image to be processed about a predetermined target includes: and adding products of the attention diagrams and corresponding semantic segmentation results, and taking the sum result as a recognition result of the image to be processed about a preset target.

In one embodiment, the method further comprises:

and splicing the recognition result of the to-be-processed image about the preset target with other recognition results of other to-be-processed images about the preset target to obtain the recognition result of the spectrum remote sensing image about the preset target, wherein the other to-be-processed images are other images obtained by segmenting the spectrum remote sensing image according to a preset scale.

According to a third aspect, there is provided an image processing apparatus for identifying a predetermined target from a spectrum remote sensing image, comprising:

an image acquisition unit: acquiring a spectral remote sensing image to be processed;

a cloud removing processing unit: detecting whether cloud clusters and/or cloud shadows exist in the spectral remote sensing image to be processed, and performing cloud cluster removing processing and/or cloud shadow removing processing under the existence condition;

a cutting unit: cutting the spectral remote sensing image to be processed into a plurality of images to be processed according to a preset size and a preset step length;

a normalization unit: respectively executing standardization operation aiming at each image channel of each image to be processed, wherein the standardization operation is used for limiting a channel value corresponding to each pixel in a preset range aiming at a single image channel;

a semantic segmentation unit: based on a pre-trained convolutional neural network, performing semantic segmentation on a plurality of images to be processed determined by the cutting operation respectively to obtain each target recognition result based on the semantic segmentation;

and a result splicing unit: and splicing the target recognition results according to the position relation of the plurality of images to be processed to obtain the recognition result of the spectral remote sensing image to be processed on the preset target.

According to a fourth aspect, there is provided an apparatus for image processing, wherein an image to be processed is obtained based on a remote sensing image, the apparatus is configured to identify a predetermined target from the image to be processed, and comprises:

the resolution adjusting unit is configured to perform resolution scaling on the image to be processed to obtain a first image corresponding to a first resolution;

the feature extraction unit is configured to extract image features of the first image on a plurality of preset channels to obtain a plurality of feature maps corresponding to the preset channels one by one;

a convolution processing unit, configured to process the plurality of feature maps through a pre-trained convolution neural network to obtain a first semantic segmentation result and a first attention map, where the convolution neural network includes a basic convolution module, and a semantic convolution module and an attention convolution module connected to the basic convolution module in parallel, the first semantic segmentation result is an output result of the semantic convolution module and includes respective probability maps corresponding to respective target classes, a single probability map describes a probability that the first image is mapped onto the image to be processed, each pixel is identified as a single target class, the first attention map is output by the attention convolution module, and the attention map indicates an importance degree of a semantic segmentation result corresponding to each pixel for the first image;

and the recognition result merging unit is configured to fuse the first semantic segmentation result and other semantic segmentation results obtained through other images based on the first attention diagram and other attention diagrams corresponding to other semantic segmentation results so as to determine a recognition result of the image to be processed about a predetermined target, wherein the other images are images which are obtained by performing resolution scaling on the image to be processed and correspond to other resolutions.

According to a fifth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first or second aspect.

According to a sixth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the method of the first or second aspect.

According to the method and the device provided by the embodiment of the specification, in the whole process, the multispectral image is cut, the cut single image to be processed is subjected to multispectral standardization, semantic segmentation and other processing, then the target identification results obtained by semantic segmentation of the images to be processed are spliced, and finally the target identification result of the multispectral image is obtained. This image processing approach is based on segmentation and deep learning, providing a more efficient approach to target recognition.

In detail, for an image to be processed, the spectral remote sensing image is zoomed on a plurality of scales (corresponding resolutions) according to different resolutions, so that semantic segmentation results of each scale and corresponding attention diagrams are obtained. Further, the semantic segmentation results are fused by using an attention map. The technical architecture introduces an attention diagram to describe the importance degree of semantic segmentation, so that the accuracy of a target recognition result is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 illustrates a schematic diagram of one implementation scenario of the present description;

FIG. 2 illustrates an overall implementation architecture diagram of one embodiment of the present description;

FIG. 3 is a diagram illustrating an image processing architecture for implementing one embodiment of the present description;

FIG. 4 illustrates a flow diagram of a method of image processing according to one embodiment;

FIG. 5 shows a convolutional neural network process flow diagram in one specific example;

FIG. 6 is a schematic view of a convolutional neural network processing flow in another specific example;

FIG. 7 shows a flow diagram of a method of image processing according to another embodiment;

FIG. 8 shows a schematic block diagram of an apparatus for image processing according to one embodiment;

fig. 9 shows a schematic block diagram of an apparatus for image processing according to another embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

First, a description will be given with reference to an embodiment shown in fig. 1. Fig. 1 is a schematic diagram of a remote sensing image. The image may include water, buildings, and various vegetation. The image shown in fig. 1 may be a region with a large actual area, which may be a county, a city, a province, or a country. The image can be overlooked and collected from high altitude by remote sensing equipment such as a spectrum camera.

The remote sensing image shown in fig. 1 can be used for a corresponding target recognition service scene, such as a predetermined vegetation recognition service scene, a water body recognition service scene, a terrain recognition (including recognition of water bodies, buildings, vegetation and the like) service scene and the like.

Specifically, referring to the implementation architecture shown in fig. 2, after the spectral remote sensing image to be processed is obtained, cloud cluster removing processing may be performed by using a cloud removing algorithm, that is, a cloud cluster in the image, a cloud shadow generated by the cloud cluster, and the like are detected, and the detected cloud cluster and/or cloud shadow are removed. For example, an image containing clouds and cloud shadows is cut out. Then, a cutting operation is carried out on the spectrum remote sensing image. It will be appreciated that the remote sensing image typically corresponds to the area of the ground, for example 500 km x 500 km. Similar to the normal image, the remote sensing image may be composed of a plurality of pixels (corresponding to one pixel of the normal image), and one pixel generally corresponds to an area on the ground, such as an area of 100 m × 100 m. When the remote sensing image is identified, if the number of pixels of the image is less, which means lower resolution, the remote sensing image can be directly used as an image for target identification to be processed, and if the number of pixels of the image is more, which means higher resolution, the remote sensing image can be cut according to a set size, and the cut image is respectively subjected to target identification. The image for identifying the predetermined target is called as a to-be-processed image, and the to-be-processed image can be a spectrum remote sensing image or a small image obtained by cutting the spectrum remote sensing image.

Then, a multispectral normalization operation may be performed separately for each image to be processed: and for a single image channel in each image channel, determining corresponding normalized values by taking the channel variance on the corresponding channel as a normalization factor according to the channel difference between the channel value corresponding to each pixel and the channel mean value on the corresponding channel. Wherein, the channel mean and the channel variance are both determined by statistics. And then, performing semantic segmentation operation on each image to be processed to obtain a corresponding target recognition result, and splicing the target recognition results of each image to be processed to obtain a recognition result of the spectral remote sensing image to be processed on a preset target. It will be appreciated that in the image to be processed, for relatively small objects (e.g., crop vegetation), a "zoomed-in image" may be required for recognition, while relatively large objects (e.g., bodies of water, mountains, etc.) may be recognized as normal, or as a "zoomed-in image". Here, "enlarging an image" means increasing the image resolution, and "reducing an image" means decreasing the image resolution. For example, the image to be processed has a resolution of 300 × 300 image elements, and after "enlargement" the image has a resolution of 600 × 600 image elements, and accordingly the ground area corresponding to a single image element is reduced. That is, in the object recognition, with respect to the image to be processed, images of a plurality of resolutions may be obtained based on reduction or enlargement of the resolution to recognize a predetermined object of a plurality of sizes. Here, the images of various resolutions may also be referred to as multi-scale images.

Fig. 3 shows a processing architecture after the image to be processed is scaled to obtain various images with various resolutions in a specific example. As shown in fig. 3, the original resolution h × w may be reduced to various resolutions such as 1/2 resolution of h × w and 1/4 resolution of h × w. The resolution scaling may be performed by upsampling, downsampling, or the like, or may be performed by bilinear interpolation, and is not described herein again.

Referring to fig. 3, for each image obtained by scaling the image to be processed, the image may be processed through a pre-trained convolutional neural network to obtain a corresponding semantic segmentation result (probability map) and an attention map. Semantic segmentation may be used for more detailed object recognition, such as identifying object classes for each pixel. Under the framework of the present specification, the probability that each pixel is classified into a single target class corresponding to the target class can be determined for each target class.

In order to process the image to be processed through the convolutional neural network, features on a plurality of image channels (such as color channels R, G, B) may be extracted from the image to be processed, so as to obtain a plurality of feature maps. As shown in fig. 3, for an image to be processed with a resolution of h × w, 4 feature maps corresponding to 4 image channels may be obtained, where the size of each feature map is h × w, and on a single channel, each element corresponds to a feature value of a corresponding pixel on the image channel, for example, a color value 255 on an R color channel. For convenience of description, in the present specification, a convolution operation part of a plurality of feature maps before a functional branch in a convolutional neural network may be referred to as a basic convolution module. The functional branch is a semantic segmentation branch and an attention determination branch, which respectively correspond to a semantic convolution module and an attention convolution module. Because the pre-trained convolutional neural network comprises a semantic convolution module and an attention convolution module on a parallel convolution channel after a basic convolution module, the to-be-processed image is processed by the convolutional neural network, and a corresponding semantic segmentation result and an attention map can be obtained.

In one aspect, a semantic segmentation result may be output by the semantic convolution module, and the semantic segmentation result may include several probability maps. For example, the number of the probability maps is consistent with the number of the target classes, and a single probability map describes the probability of each pixel of the to-be-processed image mapped to the spectrum remote sensing image being classified into the corresponding target class. In case of a resolution of the image to be processed of h × w, the number of elements of a single probability map is also h × w. The number of probability maps shown in fig. 3 is c, corresponding to c object classes. For example, c =3, and in the implementation architecture of fig. 1, if the target categories are vegetation 1, vegetation 2, and vegetation 3, 3 h × w probability maps can be obtained.

On the other hand, an attention graph can be output by the attention convolution module. The number of elements of the attention map may coincide with the pixels of the image currently being addressed which map to the image to be processed. Note that each element in the force diagram may describe the importance of the corresponding pixel at the current resolution.

Similarly, images with various resolutions obtained by the same image to be processed can be processed by the above-mentioned pre-trained convolutional neural network, and several probability maps consistent with the target class number c and one attention map are obtained. In fig. 3, the resolution of the original spectrum remote sensing image is h × w, c h × w probability maps and an attention map can be obtained for the original resolution, the image to be processed scaled to the half resolution and the quarter resolution. It is worth noting that the convolutional neural networks used to process different resolutions in fig. 3 may be the same convolutional neural network, and are respectively shown in fig. 3 for a clearer description of the processing.

Further, attention may be used as a weight to fuse semantic segmentation results (probability maps) corresponding to each image, so as to obtain a recognition result of a predetermined target for the image to be processed. In one embodiment, each semantic segmentation result may be multiplied by a corresponding attention map, and then the multiplication results are superimposed to obtain a final semantic segmentation result of the image to be processed, which is used as the recognition result of the predetermined target.

According to a possible design, the original spectral remote sensing image to be processed can be cut, and each cut part is used as an image to be processed in the processing process to obtain the recognition result of the corresponding preset target. And then, splicing the recognition results of the preset targets of the cut images to be processed together to obtain a comprehensive target recognition result aiming at the initial spectrum remote sensing image.

The implementation architecture of the present specification is described above, and a detailed description is given below to a processing flow of a spectrum remote sensing image according to an embodiment of the present specification with reference to fig. 4.

In fig. 4, a single image to be processed is taken as a processing target, and explanation is made. In an alternative implementation, the image to be processed here may be an initial whole spectrum remote sensing image, such spectrum remote sensing image may correspond to a large geographic area, for example, a county spectrum remote sensing image, and pixels of the spectrum remote sensing image may be referred to as pixels, for example, one pixel corresponds to a 20 m × 20 m ground area. In another alternative implementation, the image to be processed may be a partial remote spectrum sensing image obtained by cutting the whole remote spectrum sensing image in a predetermined manner. For example an image obtained after cutting according to a predetermined size (for example 500 x 500 picture elements) and a predetermined step size (200). Wherein the predetermined size may correspond to a predetermined resolution, for example a storage of 500 x 500 picture elements may be understood as a resolution of 500 x 500. In the case where one pixel corresponds to a ground area of 20 m × 20 m, an image having a resolution of 500 × 500 corresponds to a ground area of 10 km × 10 km.

In a possible embodiment, the resolution of the remote spectral sensing image may also be represented by a geographic area corresponding to one pixel, for example 50 m × 50 m, and the receptive field increases when the resolution increases and decreases when the resolution decreases. Thus, the cutting of the spectrally remotely sensed image can also be a cutting at such ground resolution. For example, cut to a predetermined size of 500 m by 500 m.

For object recognition, there are typically objects that are specifically recognized, such as trees, people, mountains, bodies of water, categories of crops, and so forth. Such an unambiguous object to be identified may be referred to as a predetermined object. As shown in fig. 4, the target identification process for the image to be processed is specifically described by taking an image (referred to as a first image) of any resolution corresponding to the image to be processed as an example. The flow 400 of image processing may include: step 401, performing resolution scaling on an image to be processed to obtain a first image corresponding to a first resolution; step 402, extracting image features of a first image on a plurality of preset image channels to obtain a plurality of feature maps corresponding to the preset image channels one by one; step 403, processing a plurality of feature maps through a pre-trained convolutional neural network to obtain a first semantic segmentation result and a first attention map, where the convolutional neural network includes a basic convolution module, and a semantic convolution module and an attention convolution module connected in parallel with the basic convolution module, the first semantic segmentation result is an output result of the semantic convolution module, and includes each probability map corresponding to each target class, a single probability map describes that the first image is mapped onto an image to be processed, each pixel is identified as a probability of a single target class in a predetermined target, the first attention map is output by the attention convolution module, and the attention map indicates the importance degree of the semantic segmentation result corresponding to each pixel for the first image; and step 404, fusing the first semantic segmentation result and other semantic segmentation results obtained through other images based on the first attention diagram and other attention diagrams corresponding to other semantic segmentation results, so as to determine a recognition result of the image to be processed about the predetermined target, wherein the other images are images which are obtained by performing resolution scaling on the image to be processed and correspond to other resolutions.

First, in step 401, a resolution scaling is performed on an image to be processed to obtain a first image corresponding to a first resolution. It will be appreciated that the remotely sensed images of the spectrum have their own resolution, as represented by the number of pixels as 500 x 500. In the spectrum remote sensing image, the target object may be large or small in size due to the high altitude long distance shooting. Thus, for a smaller target object, such as a crop, the resolution of the remote spectral sensing image needs to be reduced, for example, to 10 m × 10 m, for identification, whereas a large tree may be identified with a resolution of 20 m × 20 m.

In this step 401, the resolution of the image to be processed may be reduced by a down-sampling manner or enlarged by an up-sampling manner. When the resolution ratio is reduced, the number of pixels (or pixels) is increased, the area of a ground area corresponding to one pixel of the spectrum remote sensing image is reduced, namely, the image is amplified, when the resolution ratio is amplified, the number of pixels is reduced, the area of the ground area corresponding to one pixel of the spectrum remote sensing image is increased, and namely, the image is reduced.

Under the condition that the identification target is a large-area crop, a water body and the like, the resolution can be reduced aiming at the image to be processed, and the image described by fewer pixels can be obtained. For example, reducing from h × w to one-half of h × w corresponds to reducing the image to be processed by one-half. In practice, the zoom may be performed according to a predetermined ratio, or may be performed according to a set resolution, and is not limited herein. In the case where the scaled image is consistent with the image pixels to be processed, it can also be considered that 1:1 scaling is performed.

Here, for convenience of description, in an image of a certain resolution obtained by scaling an image to be processed, the resolution is denoted as a first resolution, and a corresponding image is denoted as a first image. The first resolution here may be any resolution that is set in advance or determined according to a predetermined scaling, for example, the resolution of the image to be processed is 500 × 500, reduced to 1/2, 250 × 250, or the like. Here, since the description of the resolution has two directions, when the resolution is scaled, the scaling of the resolution is generally recorded at a rate at which the visual size in a single direction increases or decreases.

Next, in step 402, image features of the first image on a plurality of predetermined image channels are extracted, and feature maps corresponding to the predetermined image channels are obtained. It will be appreciated that the first image generally corresponds to a plurality of pixels and may be decomposed into descriptions in a plurality of predetermined image channels, such as the three primary colors R, G, B image channels of color, light intensity, luminance image channel, etc. In the spectrum remote sensing image, a near infrared channel and the like can be further included for describing the spectrum characteristics of the near infrared light.

A typical image may be represented by 3-channel colors, for example: under the condition that the color coding mode is an RGB mode, the plurality of preset image channels can comprise three color coding channels respectively corresponding to three color components of red R, green G and blue B, and on a single color coding channel, the coding value of a single characteristic point is the value of a corresponding pixel on the corresponding color component; in the case that the color coding mode is the YUV mode, the plurality of predetermined image channels may include three color coding channels corresponding to three coding components of brightness Y, chroma U, and density V, respectively, and on a single color coding channel, the coding value of a single feature point is a value of a corresponding pixel on the corresponding coding component.

Among them, Near Infrared (NIR) is an electromagnetic wave between visible light (vis) and Mid Infrared (MIR), which has a more sensitive capturing effect on colors such as blue and green, and thus is particularly useful for identifying targets such as green plants and water bodies. In this way, in a target identification service scene for vegetation, water bodies and the like, a channel corresponding to near infrared light can be optionally used as a predetermined image channel. On the image channel corresponding to the near-infrared light, a frequency difference, an amplitude difference, and the like between the transmitted wave and the reflected wave may be taken as pixel characteristics.

The image features corresponding to each pixel on a single predetermined image channel may form an array of features, referred to as a feature map, having a uniform size and distribution with respect to the first image. Assuming that the predetermined number of channels is N (positive integer), N feature maps can be obtained. For example, in the case where the plurality of predetermined channels are four image channels of Near Infrared (NIR), R, G, B, as shown in fig. 2, 4 feature maps may be extracted for the first image. Assuming that the resolution of the first image is 1/2 (h × w), the number of elements of a single feature map is also 1/2 (h × w).

In a possible design, before extracting features for each image channel, a normalization operation may also be performed for a single image feature in each image feature, and each image channel may also be separately subjected to normalization to normalize the channel values of each pixel. Because the channel values of the pixels on each image channel have large difference, the performance of model training may be affected. Thus, through the normalization operation, the purpose is to define the channel values of the respective picture elements within a predetermined range.

As an example, taking the first pixel as an example, the channel value normalization method is, for example: and subtracting the channel mean value corresponding to the image channel from the original channel value to obtain a channel difference value corresponding to the first pixel, and determining the standardized value of the first pixel in the image channel according to the ratio of the channel difference value to the channel variance. The normalization value is for example positively correlated with the ratio of the channel difference to the channel variance. Wherein, the channel mean and the channel variance can be statistical results of a large number of spectrum remote sensing images. For example, the channel mean is the mean value of the channel values of the image elements of the plurality of spectrum remote sensing images on the image channel, and the channel variance is the variance determined by the channel values of the image elements on the image channel based on the channel mean.

The normalization operation may also be implemented in other ways, for example, the normalization factor on a certain image channel is the maximum difference (the difference between the maximum value and the minimum value) of the channel values corresponding to each image element, and the normalization value of the first image element may be the ratio of the difference between the channel value and the minimum value of the first image element in the image channel to the maximum difference. As another example, the normalized value of the first pixel element may be a difference between a channel value of the first pixel element at the image channel and a channel mean, a ratio of the difference to a maximum difference, and so on.

Using normalized values for the channel values may simplify the calculations and make the features extracted at the corresponding image channel more visible. For example, positive and negative values of the normalized value may be opposite characteristics, etc.

Thus, for each predetermined image channel, a corresponding characteristic map can be obtained through the channel value or the standardized value of each pixel in the corresponding image channel.

Next, in step 403, each feature map is processed through a convolutional neural network trained in advance to obtain a first semantic segmentation result and a first attention map for the first image. Under the implementation architecture of the present specification, the convolutional neural network may include a basic convolution module, and a semantic convolution module and an attention convolution module connected in parallel to the basic convolution module. The first semantic segmentation result can be an output result of the semantic convolution module, and the attention map can be an output result of the attention convolution module.

For a first image, the semantic convolution module may output a first semantic segmentation result, the first semantic segmentation result including respective probability maps corresponding to respective target classes, respectively, a single probability map describing a probability that the first image is mapped onto the image to be processed, each pixel being identified as a respective target class.

For a first image, the attention convolution module may output a first attention map. The first attention map indicates the importance degree of the semantic segmentation result corresponding to each pixel in the first semantic segmentation result aiming at the image to be processed. The corresponding quantitative representation of the importance in a force diagram may be referred to as an attention value, or confidence.

It should be noted that the first semantic segmentation result, the first attention map, and the like are described herein to correspond to the first image without limiting the essential meaning thereof.

The basic convolution module of the convolutional neural network can be used for processing a plurality of feature maps which are stacked together to mine the association relationship between each pixel and each channel of the pixel, so as to prepare for semantic segmentation and attention determination.

In one embodiment, the basic convolution module in the convolutional neural network may be a conventional convolution structure with a plurality of convolution kernels, and fig. 5 shows a specific example of a convolutional neural network implementation architecture under the conventional convolution structure. In the basic convolution module, each convolution layer is characterized by a different receptive field (a local region of the neuron's connected previous layer that appears in width and height). Various parameters such as weights, biases, etc. of the convolution kernel representation are required in different receptive field extraction feature processes. In the convolutional neural network training process, parameters are continuously adjusted according to sample labels (the sample labels of the pictures can be target labeling results), so that the basic convolutional module can extract various implicit features which have significance on target identification results, output semantic segmentation results of all pixels and determine the significance degree of the semantic segmentation results. As shown in fig. 5, a part of the semantic convolution modules and the attention convolution modules of the convolutional neural network may be implemented by two parallel convolution units, and the depth (number of convolutional layer layers) and the number of channels of each convolution unit may be set as needed, for example, the number of convolutional layer output channels corresponding to the semantic convolution module may correspond to the number of target categories, and the number of convolutional layer output channels corresponding to the attention convolution module may be 1 or may be consistent with the number of convolutional layer output channels corresponding to the semantic convolution module. The depth of the semantic convolution module and the attention convolution module shown in fig. 5 are both 1, and in practice, may be set to any other reasonable value as needed.

In another embodiment, the basic convolution module may be implemented using a High-Resolution convolution (HRnet) architecture. Fig. 6 shows a specific example of an implementation architecture using a resolution convolution structure.

The multi-resolution layer may be implemented by convolution means such as HRnet, HRnet + (modified HRnet), Deformable Convolution (DCN), Pyramid convolution (ASPP), and the like. Employed in the example of fig. 6 is HRnet, which divides the input convolution channel into several sub-convolution channels and performs a conventional convolution for a different spatial resolution in each sub-convolution channel. In the multi-resolution convolution, the resolution of each sub-convolution channel is different, and each sub-convolution channel can perform convolution operation on the feature map with successively decreasing resolution. As shown in fig. 6, when a connection is made between a single sub-convolution channel and a lower sub-convolution channel, the resolution is reduced by the down-sampling method. The lower level sub-convolution channel can be understood as a newly added convolution channel. In the example of fig. 6, the initial feature map of the new sub-convolution channel processing is determined by the downsampling result of the current layer of all the upper sub-convolution channels. It should be noted that the sub-convolution channels herein can be understood as different convolution lines, and have different concepts from the number of image channels and the output channels of the feature map. The number of image channels may also be understood as the number of feature maps, for example, the image features collected in step 402 are feature maps on four image channels of red R, green G, blue B, and near infrared NIR, respectively, which are 4 feature maps in total. Further, the 4 feature maps may be processed through one convolution channel (line), or through a plurality of convolution channels, and each convolution channel may process the 4 feature maps. The output channel and the image channel have similar concepts and will not be described in detail herein.

According to an embodiment, as shown in fig. 6, under the architecture that the basic convolution module includes the multi-resolution convolution layer, it may further include a fusion layer, and the convolution results of the multi-resolution convolution layer may be fused at the fusion layer to obtain a fused feature map. For example, the low-resolution convolution result may be adjusted to a high resolution by means of upsampling or the like and then fused, or the higher-resolution convolution result may be downsampled and the lower-resolution convolution result may be upsampled, so that the convolution results of each sub-convolution channel are all adjusted to a feature map with a predetermined resolution and then fused. The fusion method may be, for example, averaging or maximizing feature points at corresponding positions on each feature map. The merged convolution result can be processed by the convolution module of the semantic convolution module and the attention convolution module respectively to obtain a semantic segmentation result and an attention diagram. The fused feature map may include one or more feature maps.

In an alternative implementation, the fusion layer may further include a feature context identification layer. The role of the feature context representation layer is to perform feature enhancement to better distinguish feature points corresponding to different predetermined target classes.

In one embodiment, the feature context identification layer is implemented by OCR (Object-context Representations). The basic architecture of the OCR comprises rough classification and homogeneous feature point fusion, and can also comprise processing an intermediate convolution result by using a homogeneous feature point fusion result to obtain a feature enhancement result. Under the OCR framework, the pre-classification result (coarse classification result) of each feature point can be determined according to the fused feature map. The classification result here may indicate, for example, a predetermined target category corresponding to each feature point, such as a corresponding vegetation or water body. For a single feature point, feature values of a plurality of feature points having the same predetermined target category (for example, all vegetation) on the pre-classification result may be fused to obtain a feature value fusion value, and the feature value fusion value is used to update the feature value of the single feature point, so that the single feature point can better embody the features of the corresponding predetermined target category, and the purpose of feature enhancement is achieved.

The feature value fusion method may be at least one of summing, averaging, and maximizing, for example. For example, for a feature point, the feature values of its neighboring feature points are averaged to obtain the feature value of the current feature point. The neighbor nodes may be predefined, for example, include only four feature points closest to each other (e.g., located above, below, left, and right, respectively), or include 8 feature points closest to each other (e.g., located above, below, left, right, above-left, below-left, above-right, and below-right, respectively). Optionally, the blending layer may also connect to a render (e.g., PointRender, etc.) layer after the OCR layer to improve image resolution, which is not described herein again.

In another embodiment, the feature context identification layer may also be implemented by an attention mechanism. For example, one can target a single feature point: and determining the association degree between a single feature point and other feature points through an attention mechanism, selecting a plurality of feature points for feature value fusion according to the sequence of the association degrees from large to small, and updating the feature value of the single feature point by using the obtained feature value fusion value. The association degree between a single feature point and each of the other feature points may be determined, for example, by positively correlating feature value differences between the feature points or by similarity between vectors formed by the feature points corresponding to feature values on each of the channels. If the fusion result includes 3 feature maps, each feature point, for example, the feature point in the first row and the first column, corresponds to a value on each feature map, and thus the feature point may correspond to a 3-dimensional vector. The degree of association between two feature points can be described by the vector similarity between the 3-dimensional vectors respectively corresponding to the two feature points. When the vector similarity corresponding to the two feature points is large, the two feature points can be considered to have consistent predetermined target categories, and can be mutually used for feature enhancement.

In an alternative embodiment, in the aforementioned OCR or attention mechanism, a part of feature points may be further selected from the corresponding feature points for feature value fusion. For example, in the attention mechanism, for a single feature point, feature points with a relevance value greater than a predetermined threshold are selected for feature value fusion, or a predetermined number of feature points are selected for feature value fusion from large to small according to the relevance.

Further, the semantic convolution module may process the processing result of the basic convolution module through at least one convolution layer to output several probability maps describing the probability that each pixel is classified into a predetermined target class (or identified as a predetermined target) by the number of feature points consistent with the image to be processed. The number of probability maps may be consistent with the target class. The attention convolution module can output a plurality of feature point attention maps with the same number of pixels as the image to be processed through the processing result of the convolution layer processing basic convolution module, and describes the importance degree or the confidence degree of the current probability of each pixel point in the probability map output by the semantic convolution module.

In addition to the first image, several other images of the image to be processed corresponding to other resolutions, such as a second image corresponding to the second resolution, a third image corresponding to the third resolution, etc., may be obtained in a similar manner as in step 401. Then, for these several other images, it can be performed in a similar manner as

steps

402, 403, respectively, so as to obtain corresponding other semantic segmentation results, and an attention map. For example, a second image is processed to obtain a second semantic segmentation result and a second attention map, a third image is processed to obtain a third semantic segmentation result and a third attention map, and so on.

Further, the first semantic segmentation result is fused with other semantic segmentation results obtained through other images based on the first attention diagram and other attention diagrams corresponding to the other semantic segmentation results, so as to determine a recognition result of the image to be processed about the predetermined target, through step 404. Here, the other image is an image similar to the first image involved in step 401, and may be obtained by resolution scaling from the image to be processed, such as the aforementioned second image, third image, and the like. It can be understood that, in the semantic segmentation result obtained by each image, the number of probability maps and the resolution of each probability map can be consistent, and the attention is attempted to be consistent with the resolution of a single probability map.

As a specific example, assuming that the resolution of the image to be processed is h × w, the probability map corresponding to the first image may be a probability map corresponding to a predetermined target (e.g., a wheat crop), and h × w elements are multiplied by h × w elements in the attention map in a one-to-one correspondence manner, as the first product. And performing similar processing on the probability maps of other images at the preset target to obtain other products, such as a second product, a third product and the like. Then, the products corresponding to the first image and other images are added to obtain a probability map of the image to be processed corresponding to the preset target. Similarly, probability maps of the image to be processed on each predetermined target can be obtained, and the results of the probability maps are integrated (for example, the target category with the highest probability is selected for each pixel point as the target category to be finally identified), so that the target identification result for the image to be processed is obtained.

As another specific example, assuming that the resolution of the image to be processed is h × w, the semantic segmentation result is regarded as a three-dimensional matrix of h × w × C, and the attention map is regarded as a two-dimensional matrix of h × w, the attention map may be regarded as a semantic segmentation result weight coefficient matrix, and the fusion of the respective semantic segmentation results may be weighted fusion with the corresponding attention maps as weight coefficient matrices. Assuming that the number of images with different resolutions corresponding to the image to be processed is N, the corresponding semantic segmentation result of an image i with a certain resolution is C_iAttention is sought as E_iThen, the target identification result of the spectrum remote sensing image may be:

wherein the content of the first and second substances,

indicating that the elements are multiplied one by one in order or position.

The multiplication result of (2) is, for example, a three-dimensional matrix of h × w × C.

In a possible implementation manner, if the image to be processed is a part cut from the initial spectrum remote sensing image, the corresponding target recognition results may be spliced according to the arrangement positions of the cut parts before cutting, so as to obtain the target recognition result corresponding to the initial spectrum remote sensing image.

The convolutional neural network in the above process can be trained by a large number of picture samples obtained from the spectrum remote sensing image. And the single picture sample is an initial spectrum remote sensing image or other to-be-processed images obtained by cutting the spectrum remote sensing image. The single picture sample can also correspond to a pre-labeled target recognition result. The target recognition result may indicate a target class to which each pixel belongs. Each picture sample is processed through the image processing flow shown in fig. 4, the target recognition result obtained in step 404 is compared with the pre-labeled target recognition result to obtain a loss, and model parameters, such as parameters of convolution kernel and bias in the convolutional neural network, are adjusted in the direction of reducing the loss.

In a possible design, referring to fig. 2, during the picture sample construction process, the spectrum remote sensing image needs to be visualized. Visualization is understood to be a facilitation of the recognition of the human eye or the annotation model. For example, the R, G, B channel of the spectrum remote sensing image is converted into a true color image, which is convenient for labeling some houses, vehicles, ships and the like, and then the NIR, R and G channels of the spectrum remote sensing image are converted into a false color image, which is convenient for labeling some forests, cultivated lands and the like. And then labeling the interest region by a labeling person or a labeling model, for example, labeling the boundary line of a predetermined target.

In one embodiment, in order to improve the training efficiency and stability of the semantic segmentation model, data enhancement may be performed on the cut to-be-processed image and the region-of-interest labeling image. The manner of data enhancement may be, for example, multi-scale transformation, random cropping, random mirroring, random illumination transformation, random gamma (gamma) correction, and the like. The direct effect of data enhancement is to reduce the effect of illumination inhomogeneities on the image.

Those skilled in the art will readily understand that, during the training of the convolutional neural network, the meanings of the output results of the semantic convolutional module and the attention convolutional module are assumed, and therefore, in principle, the subsequent processing procedure uses the corresponding output result as an intermediate result of the pre-assumed meanings to obtain the target recognition result of the final step 404, and therefore, after adjusting the parameters according to the comparison between the final output result and the labeling result of the picture sample, the output results of the semantic convolutional neural network and the attention convolutional module can be regarded as the meanings assumed before the training of the neural network, that is, the semantic segmentation result and the attention map, respectively.

It will be appreciated that in the process of acquiring a remote spectral sensing image, the acquisition device typically acquires the corresponding spectral signal from high altitude via the emitted light, and therefore, the acquired image may contain some obstacles, such as cloud cover, cloud cover projection, etc. In an optional implementation manner, before or after determining the image to be processed, at least one of a cloud cluster removing process, a shadow removing process, and the like may be performed to remove cloud layers, cloud layer projections, and the like in the spectral remote sensing image. The cloud de-clustering process may use algorithms such as depllabv 3+, HRNetOCR (Object-context Representations), etc. to build a deep learning cloud segmentation algorithm, so as to remove the cloud layer, shadow, etc. that have influence on the result accuracy more accurately.

Reviewing the above process, the method provided in the embodiment of the present specification, when processing an image to be processed, scales the image to be processed on multiple scales (corresponding to multiple resolutions) according to different resolutions, so as to obtain semantic segmentation results and corresponding attention maps of the respective scales. Further, the semantic segmentation results are fused by using an attention map. The method introduces an attention diagram to describe the importance of semantic segmentation on each scale, so that the accuracy of a target recognition result is improved.

Referring to fig. 7, an image processing flow for the spectrum remote sensing image in a specific example is shown. The execution subject of the flow may be any electronic device, device cluster, etc. with certain computing power. The process 700 includes:

and step 701, acquiring a spectral remote sensing image to be processed.

And 702, detecting whether cloud clusters and/or cloud shadows exist in the spectral remote sensing image to be processed, and performing corresponding cloud cluster removing processing and/or cloud shadow removing processing under the existence condition. The detection mode may be various cloud removing algorithms, for example, a deep learning cloud removing model constructed by algorithms such as depeplabv 3+, HRNetOCR and the like. The way of cloud cluster and cloud shadow removal is for example to scratch out the respective partial images.

And 703, cutting the spectral remote sensing image to be processed into a plurality of images to be processed according to a preset size and a preset step length. For example, the spectrum remote sensing image is cut according to the pixel number of 50 multiplied by 50 and the step length of 20 pixels, or the mode of corresponding to the ground area of 500 meters multiplied by 500 meters and the step length of 200 meters, etc.

Step 704, respectively executing a normalization operation for each image channel of each image to be processed. For a single image channel, a normalization operation is used to define the channel values for each image element to a predetermined range. For example, the normalized value of the first pixel in a certain image channel is positively correlated with the following ratio: the ratio of the channel difference between the channel value corresponding to the pixel and the channel mean value on the single image channel to the channel variance on the single image channel. For another example, the normalization factor on a certain image channel is the maximum difference (the difference between the maximum value and the minimum value) of the channel values corresponding to each pixel, and the normalization value of the first pixel may be the ratio of the difference between the channel value of the first pixel in the image channel and the minimum value/channel mean value to the maximum difference.

Step 705, semantic segmentation is respectively performed on a plurality of images to be processed based on a pre-trained convolutional neural network, so as to obtain each target recognition result based on the semantic segmentation. For example, the target recognition result may be determined for each to-be-processed image by using the flow shown in fig. 4, which is not described herein again.

And 706, splicing the target recognition results according to the position relation of the plurality of images to be processed to obtain the recognition result of the spectral remote sensing image to be processed on the preset target. It is understood that the image to be processed is cut by the pair of spectral remote sensing images according to a predetermined size and a predetermined step size, and thus the positional relationship therebetween is known. According to the position relation, the recognition results of the images to be processed can be spliced. Because the image to be processed is an image obtained by cutting the spectrum remote sensing image, and is not an image obtained by processing modes such as compression and the like, more information of the image can be well reserved, and thus the target identification result of the spectrum remote sensing image is more accurate.

According to an embodiment of another aspect, an apparatus for image processing is also provided. The device can be arranged in a terminal, a server or a computing device with certain computing power and used for identifying the preset target aiming at the image to be processed. As shown in fig. 8, the apparatus 800 for image processing may include:

a resolution adjustment unit 81 configured to perform resolution scaling on the image to be processed to obtain a first image corresponding to a first resolution;

a feature extraction unit 82 configured to extract image features of the first image on a plurality of predetermined image channels, and obtain a plurality of feature maps corresponding to the respective predetermined image channels one to one;

the convolution processing unit 83 is configured to process the plurality of feature maps through a pre-trained convolution neural network to obtain a first semantic segmentation result and a first attention map, wherein the convolution neural network comprises a basic convolution module, and a semantic convolution module and an attention convolution module which are connected with the basic convolution module in parallel, the first semantic segmentation result is an output result of the semantic convolution module and comprises probability maps respectively corresponding to target categories, a single probability map describes the probability that the first image is mapped onto the image to be processed, each pixel is identified as a single target category, the first attention map is output by the attention convolution module, and the attention map indicates the importance degree of the semantic segmentation result respectively corresponding to each pixel for the first image;

and the recognition result merging unit 84 is configured to fuse the first semantic segmentation result and other semantic segmentation results obtained through other images based on the first attention map and other attention maps corresponding to other semantic segmentation results, so as to determine a recognition result of the image to be processed about the predetermined target, wherein the other images are images corresponding to other resolutions obtained by performing resolution scaling on the image to be processed.

In one embodiment, the image to be processed is obtained by segmenting the spectral remote sensing image according to a preset scale.

In one embodiment, the apparatus 800 may further include a preprocessing unit (not shown) configured to perform at least one of the following preprocessing on the remote spectrum sensing image before slicing the remote spectrum sensing image according to the predetermined scale to obtain the image to be processed: cloud cluster removing and cloud shadow removing.

According to one possible design, the image to be processed corresponds to an initial resolution, and the resolution adjustment unit 81 is further configured to:

obtaining a first image corresponding to the first resolution through upsampling under the condition that the first resolution is greater than the initial resolution;

in the case where the first resolution is smaller than the initial resolution, a first image corresponding to the first resolution is obtained by down-sampling.

In a specific example, the plurality of predetermined image channels include near infrared channels in a business scenario in which the predetermined target includes at least one of vegetation, a body of water.

According to an optional implementation manner, the basic convolution module includes a multi-resolution convolution layer, the multi-resolution convolution layer divides a single convolution channel into a plurality of sub-convolution channels, each sub-convolution channel corresponds to convolution operation with sequentially decreasing resolution, and when the single sub-convolution channel is connected with a sub-convolution channel of a lower level, the resolution is reduced through a down-sampling mode.

In an optional embodiment, the base convolution module further includes a fusion layer, and the fusion layer is configured to perform fusion of a uniform resolution on convolution results of the convolution channels, where the uniform resolution is upsampling or downsampling, and the fusion mode includes at least one of averaging, maximizing, and summing.

In a further optional embodiment, the fusion layer further comprises a feature context representation layer, which may be implemented by object context representation OCR, for updating the feature values of the respective feature points by:

determining a pre-classification result of each feature point according to the fused feature map, wherein the pre-classification result indicates a predetermined target class corresponding to each feature point;

and for a single feature point, updating the feature value of the single feature point by using the feature value fusion value of a plurality of feature points with the same predetermined target class as the single feature point in the pre-classification result.

In another further alternative embodiment, the feature context representation layer is implemented by an attention mechanism for updating feature values of a single feature point by:

determining the association degree between a single characteristic point and other various characteristic points through an attention mechanism;

and selecting a plurality of characteristic points for characteristic value fusion according to the sequence of the relevance degrees from large to small, and updating the characteristic value of the single characteristic point by using the obtained characteristic value fusion value.

Optionally, the recognition result merging unit 84 is further configured to, after the first semantic segmentation result is merged with the other semantic segmentation results obtained via the other images:

and performing addition operation on products of the attention diagrams and corresponding semantic segmentation results, and taking the sum result as a recognition result of the image to be processed about the preset target.

In one possible design, the apparatus 800 may further include a splicing unit (not shown) configured to:

and splicing the recognition result of the image to be processed about the preset target with other recognition results of other images to be processed about the preset target to obtain the recognition result of the spectrum remote sensing image about the preset target, wherein the other images to be processed are other images obtained by segmenting the spectrum remote sensing image according to a preset scale.

It should be noted that the apparatus 800 shown in fig. 8 is an apparatus embodiment corresponding to the method embodiment shown in fig. 4, and the corresponding description in the method embodiment shown in fig. 4 is also applicable to the apparatus 800, and is not repeated herein.

According to an embodiment by an aspect, there is also provided an apparatus of image processing. The device can be arranged on a terminal, a server, a computing device or a device cluster with certain computing power and is used for identifying the predetermined target aiming at the spectrum remote sensing image. As shown in fig. 9, the apparatus 900 for image processing may include:

the image acquisition unit 91: acquiring a spectral remote sensing image to be processed;

the cloud removing processing unit 92: detecting whether cloud clusters and/or cloud shadows exist in the spectral remote sensing image to be processed, and performing cloud cluster removal processing/cloud shadow removal processing under the existence condition;

the cutting unit 93: cutting the spectral remote sensing image to be processed into a plurality of images to be processed according to a preset size and a preset step length;

the normalization unit 94: respectively executing standardization operation aiming at each image channel of each image to be processed, wherein the standardization operation can limit the channel value corresponding to each pixel in a preset range;

semantic segmentation unit 95: based on a pre-trained convolutional neural network, performing semantic segmentation on a plurality of images to be processed determined by the segmentation operation respectively to obtain each target recognition result based on the semantic segmentation;

the result splicing unit 96: and splicing the target recognition results according to the position relation of the plurality of images to be processed to obtain the recognition result of the spectral remote sensing image to be processed on the preset target.

It should be noted that the apparatus 900 shown in fig. 9 is an apparatus embodiment corresponding to the method embodiment shown in fig. 7, and the corresponding description in the method embodiment shown in fig. 7 is also applicable to the apparatus 900, and is not repeated herein.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 4 or fig. 7.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 4 or fig. 7.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of this specification may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments are intended to explain the technical idea, technical solutions and advantages of the present specification in further detail, and it should be understood that the above-mentioned embodiments are merely specific embodiments of the technical idea of the present specification, and are not intended to limit the scope of the technical idea of the present specification, and any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the embodiments of the present specification should be included in the scope of the technical idea of the present specification.

Claims

1. A method of image processing for identifying a predetermined target from a spectrally remote sensed image, comprising:

acquiring a spectral remote sensing image to be processed;

detecting whether cloud clusters or cloud shadows exist in the spectral remote sensing image to be processed, and performing cloud cluster removing processing or cloud shadow removing processing under the existence condition;

cutting the spectral remote sensing image to be processed into a plurality of images to be processed according to a preset size and a preset step length;

respectively executing standardization operation aiming at each image channel of each image to be processed, wherein the standardization operation is used for limiting a channel value corresponding to each pixel in a preset range aiming at a single image channel;

performing semantic segmentation on the plurality of images to be processed respectively based on a pre-trained convolutional neural network to obtain each target recognition result based on the semantic segmentation, wherein in the process of performing the semantic segmentation on a single image to be processed, images under multiple resolutions are obtained through resolution scaling, and each semantic segmentation result is obtained by performing the semantic segmentation on the image under each resolution through the convolutional neural network respectively, wherein the convolutional neural network comprises a basic convolution module, a semantic convolution module and an attention convolution module which are connected with the basic convolution module in parallel, for the image under the single resolution, the output result of the semantic convolution module is the semantic segmentation result comprising each probability map respectively corresponding to each target category, and the single probability map describes that the image under the single resolution is mapped to the corresponding image to be processed, the attention convolution module outputs an attention diagram corresponding to the image under the single resolution, the attention diagram indicates the importance degree of semantic segmentation results corresponding to the image under the corresponding to-be-processed image, and therefore the semantic segmentation results of the image under the resolution are fused to obtain a single target identification result corresponding to the single to-be-processed image based on the corresponding attention diagram;

and splicing the target recognition results according to the position relation of the plurality of images to be processed to obtain the recognition result of the spectral remote sensing image to be processed on the preset target.

2. A method of image processing, wherein an image to be processed is obtained based on a remote-sensing image, the method for identifying a predetermined target from the image to be processed, comprising:

carrying out resolution scaling on the image to be processed to obtain a first image corresponding to a first resolution;

extracting image features of the first image on a plurality of preset image channels to obtain a plurality of feature maps which are in one-to-one correspondence with the preset image channels;

processing the plurality of feature maps through a pre-trained convolutional neural network to obtain a first semantic segmentation result and a first attention map for the first image, wherein the convolutional neural network comprises a basic convolution module, and a semantic convolution module and an attention convolution module which are connected with the basic convolution module in parallel, the first semantic segmentation result is an output result of the semantic convolution module and comprises probability maps respectively corresponding to target categories, a single probability map describes the probability that the first image is mapped onto the image to be processed, each pixel is identified as a single target category in a predetermined target, the first attention map is output by the attention convolution module, and the attention map indicates the importance degree of the semantic segmentation result corresponding to each pixel for the image to be processed;

and fusing the first semantic segmentation result with other semantic segmentation results obtained through other images based on the first attention diagram and other attention diagrams corresponding to other semantic segmentation results to determine a recognition result of the image to be processed about a predetermined target, wherein the other images are images which are obtained by performing resolution scaling on the image to be processed and correspond to other resolutions.

3. The method according to claim 2, wherein the image to be processed is obtained by segmenting the spectral remote sensing image according to a preset scale.

4. The method according to claim 3, wherein before the image to be processed is obtained by segmenting the spectral remote sensing image according to a preset scale, at least one of the following preprocessing is carried out: cloud cluster removing processing and cloud shadow removing processing.

5. The method according to claim 2, wherein each channel mean value and each channel variance determined by statistics of a plurality of spectral remote sensing images respectively correspond to each image channel of the spectral remote sensing images; before the resolution scaling is carried out on the image to be processed, the following spectrum standardization processing is also carried out for each image channel:

subtracting the corresponding channel mean value from the channel value corresponding to each pixel on a single image channel to obtain a channel difference value corresponding to the single image channel;

and determining the normalized value of each pixel on the single image channel according to the channel difference and the channel variance corresponding to the single image channel.

6. The method of claim 2, wherein the image to be processed has an initial resolution, and wherein the scaling the image to be processed to obtain the first image corresponding to the first resolution comprises:

obtaining a first image corresponding to a first resolution through upsampling under the condition that the first resolution is greater than the initial resolution;

and obtaining a first image corresponding to the first resolution by downsampling if the first resolution is smaller than the initial resolution.

7. The method of claim 2, wherein the plurality of predetermined image channels comprise near infrared channels in a traffic scene where the predetermined target comprises at least one of vegetation, a body of water.

8. The method of claim 2, wherein the base convolution module includes a multi-resolution convolution layer that divides a single convolution channel into a plurality of sub-convolution channels, each sub-convolution channel corresponding to a convolution operation in which resolution is sequentially decreased, and the resolution is decreased by a down-sampling manner when a connection is made between the single sub-convolution channel and a lower sub-convolution channel.

9. The method according to claim 8, wherein the base convolution module further comprises a fusion layer, and the fusion layer is configured to perform uniform resolution fusion on convolution results of the convolution channels to obtain a fusion feature map, wherein the uniform resolution mode is upsampling and/or downsampling, and the fusion mode includes at least one of averaging, maximizing and summing.

10. The method of claim 9, wherein the fusion layer further comprises a feature context representation layer for updating feature values of respective feature points by:

11. The method of claim 9, wherein the fusion layer further comprises a feature context representation layer for updating feature values of a single feature point by:

12. The method of claim 2, wherein the fusing the first semantic segmentation result with other semantic segmentation results obtained via other images based on the first attention map and other attention maps corresponding to the other semantic segmentation results to determine a recognition result of the image to be processed with respect to a predetermined target comprises:

and adding products of the attention diagrams and corresponding semantic segmentation results, and taking the sum result as a recognition result of the image to be processed about a preset target.

13. The method of claim 3, further comprising:

14. An image processing apparatus for identifying a predetermined target from a spectrum remote sensing image, comprising the following units:

a cloud removing processing unit: detecting whether cloud clusters or cloud shadows exist in the spectral remote sensing image to be processed, and performing cloud cluster removing processing or cloud shadow removing processing under the existence condition;

a semantic segmentation unit: based on a pre-trained convolutional neural network, semantic segmentation is respectively carried out on a plurality of images to be processed determined by the cutting unit to obtain each target recognition result based on the semantic segmentation, wherein in the process of carrying out the semantic segmentation on a single image to be processed, images under multiple resolutions are obtained through resolution scaling, and the semantic segmentation is respectively carried out on the images under the resolutions through the convolutional neural network to obtain each semantic segmentation result, wherein the convolutional neural network comprises a basic convolution module, a semantic convolution module and an attention convolution module which are connected with the basic convolution module in parallel, aiming at the image under the single resolution, the output result of the semantic convolution module is that the semantic segmentation result comprises each probability map respectively corresponding to each target category, and the single probability map describes that the image under the single resolution is mapped to the corresponding image to be processed, the attention convolution module outputs an attention diagram corresponding to the image under the single resolution, the attention diagram indicates the importance degree of semantic segmentation results corresponding to the image under the corresponding to-be-processed image, and therefore the semantic segmentation results of the image under the resolution are fused to obtain a single target identification result corresponding to the single to-be-processed image based on the corresponding attention diagram;

15. An apparatus for image processing, wherein an image to be processed is obtained based on a remote-sensing image, the apparatus for recognizing a predetermined target from the image to be processed, comprising:

the characteristic extraction unit is configured to extract image characteristics of the first image on a plurality of preset image channels to obtain a plurality of characteristic graphs corresponding to the preset image channels one by one;

a convolution processing unit, configured to process the plurality of feature maps through a pre-trained convolution neural network to obtain a first semantic segmentation result and a first attention map, wherein the convolution neural network includes a basic convolution module, and a semantic convolution module and an attention convolution module connected to the basic convolution module in parallel, the first semantic segmentation result is an output result of the semantic convolution module and includes respective probability maps corresponding to respective target classes, a single probability map describes a probability that the first image is mapped onto the image to be processed, each pixel is identified as a single target class in a predetermined target, the first attention map is output by the attention convolution module, and the attention map indicates, for the first image, a degree of importance of the semantic segmentation result corresponding to each pixel;

16. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-13.

17. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-13.