CN115620017A

CN115620017A - Image feature extraction method, device, equipment and storage medium

Info

Publication number: CN115620017A
Application number: CN202211219646.XA
Authority: CN
Inventors: 张振林; 汤桂璇; 赵起超; 袁金伟
Original assignee: China Automotive Innovation Co Ltd
Current assignee: China Automotive Innovation Co Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-17

Abstract

The application discloses a method, a device, equipment and a storage medium for extracting features of an image, which relate to the technical field of automatic driving, can ensure that an obtained feature map comprises rich semantic information and accurate position information, and can improve the accuracy of target detection and the experience of automatic driving. The scheme comprises the following steps: carrying out convolution processing on the image to obtain n original characteristic graphs; performing feature extraction on the original feature map by using a channel attention mechanism model to obtain a channel attention feature map; performing feature extraction on the nth original feature map by using a preset spatial attention mechanism model to obtain a corresponding spatial attention feature map; and performing layer-by-layer feature fusion on the basis of the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map and the 1 st original feature map to obtain a target feature map of the image, and performing target detection in the automatic driving scene on the basis of the target feature map.

Description

Image feature extraction method, device, equipment and storage medium

Technical Field

The present application relates to the field of automatic driving technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting features of an image.

Background

In special scenes such as driving scenes and industrial and mining with complex environments, requirements on real-time performance and accuracy of target detection are quite high, feature extraction of images is the basis of various intensive calculation type downstream tasks such as target detection or image segmentation, and the accuracy of the downstream tasks such as subsequent target detection and image segmentation is directly influenced by the accuracy of image feature extraction results.

At present, most of image feature extraction methods are to obtain feature maps of images through a convolutional neural network, but the feature maps obtained by the feature extraction methods have rich semantic information but fuzzy position information, which affects the accuracy of target detection and further affects the experience of automatic driving.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for extracting the features of an image, which can ensure that the obtained feature map comprises rich semantic information and accurate position information, and can improve the accuracy of target detection and the experience of automatic driving.

In order to achieve the purpose, the technical scheme is as follows:

in a first aspect of the embodiments of the present application, a method for extracting features of an image is provided, where the method includes:

acquiring an image to be processed;

carrying out convolution processing on the image by using n convolution structures in a preset convolution neural network model to obtain n original characteristic graphs, wherein n is an integer larger than 4;

performing feature extraction on each original feature map in the 2 nd to nth original feature maps by using a preset channel attention mechanism model to obtain a channel attention feature map corresponding to each original feature map in the 2 nd to nth original feature maps;

performing feature extraction on the nth original feature map by using a preset spatial attention mechanism model to obtain a spatial attention feature map corresponding to the nth original feature map;

performing layer-by-layer feature fusion on the basis of the channel attention feature map corresponding to the 2 nd to nth original feature maps, the space attention feature map corresponding to the nth original feature map and the 1 st original feature map to obtain a target feature map of the image;

and carrying out target detection in the automatic driving scene based on the target characteristic diagram.

In one embodiment, performing layer-by-layer feature fusion based on a channel attention feature map corresponding to the 2 nd to nth original feature maps, a spatial attention feature map corresponding to the nth original feature map, and the 1 st original feature map to obtain a target feature map of an image, includes:

determining a fusion graph corresponding to the nth original feature graph based on the spatial attention feature graph and the channel attention feature graph corresponding to the nth original feature graph;

and performing layer-by-layer feature fusion based on the fusion graph and the space attention feature graph corresponding to the nth original feature graph, and the channel attention feature graph and the 1 st original feature graph corresponding to the (n-1) th to the 2 nd original feature graphs to obtain a target feature graph of the image.

In one embodiment, performing feature fusion layer by layer based on a fusion graph and a spatial attention feature graph corresponding to an nth original feature graph, a channel attention feature graph corresponding to an (n-1) th to a (2) nd original feature graph, and the 1 st original feature graph to obtain a target feature graph of an image, including:

performing layer-by-layer feature fusion on the basis of a fusion graph corresponding to the nth original feature graph, the spatial attention feature graph and the channel attention feature graphs corresponding to the (n-1) th to 2 nd original feature graphs to obtain a fusion graph corresponding to the 2 nd original feature graph;

and obtaining a target feature map of the image based on the fusion map corresponding to the 2 nd original feature map and the 1 st original feature map.

In one embodiment, performing feature fusion layer by layer based on a fusion map corresponding to an nth original feature map, a spatial attention feature map, and a channel attention feature map corresponding to an n-1 st to a 2 nd original feature maps to obtain a fusion map corresponding to the 2 nd original feature map, includes:

starting to execute at least one fusion processing process from the condition that i is n-1 until i is 2, and obtaining a fusion graph corresponding to the 2 nd original characteristic graph, wherein i is an integer from (n-1) to 2;

wherein, the mth fusion treatment process comprises the following steps: and obtaining a fused graph corresponding to the ith original feature map according to the channel attention feature map corresponding to the ith original feature map, the space attention feature map corresponding to the nth original feature map and the fused graph corresponding to the (i + 1) th original feature map, wherein i is an integer from (n-1) to 2 in sequence.

In one embodiment, obtaining a fusion map corresponding to an ith original feature map according to a channel attention feature map corresponding to the ith original feature map, a spatial attention feature map corresponding to an nth original feature map, and a fusion map corresponding to an (i + 1) th original feature map includes:

performing deconvolution processing on the fusion graph corresponding to the (i + 1) th original feature graph to obtain a reference graph corresponding to the (i + 1) th original feature graph;

carrying out up-sampling processing on a spatial attention feature map corresponding to the nth original feature map to obtain a middle map;

and (3) carrying out information integration processing on the reference graph and the intermediate graph corresponding to the (i + 1) th original feature graph and the channel attention feature graph corresponding to the ith original feature graph to obtain a fusion graph corresponding to the ith original feature graph.

In one embodiment, obtaining the target feature map of the image based on the fusion map corresponding to the 2 nd original feature map and the 1 st original feature map includes:

performing up-sampling on the fusion graph corresponding to the 2 nd original characteristic graph to obtain a reference graph corresponding to the 2 nd original characteristic graph;

and performing information integration processing on the reference image corresponding to the 2 nd original feature image and the 1 st original feature image to obtain a target feature image.

In one embodiment, determining a fusion map corresponding to the nth original feature map based on the spatial attention feature map and the channel attention feature map corresponding to the nth original feature map comprises:

and performing information integration processing on the space attention feature map corresponding to the nth original feature map and the channel attention feature map corresponding to the nth original feature map to obtain a fusion map corresponding to the nth original feature map.

The embodiment of the application provides a feature extraction device of image, the device includes:

the acquisition module is used for acquiring an image to be processed;

the convolution module is used for carrying out convolution processing on the image by utilizing n convolution structures in a preset convolution neural network model to obtain n original feature maps, wherein n is an integer larger than 4;

the first processing module is used for performing feature extraction on each of the 2 nd to nth original feature maps by using a preset channel attention mechanism model to obtain a channel attention feature map corresponding to each of the 2 nd to nth original feature maps;

the second processing module is used for extracting the features of the nth original feature map by using a preset spatial attention mechanism model to obtain a spatial attention feature map corresponding to the nth original feature map;

the determining module is used for carrying out layer-by-layer feature fusion on the basis of a channel attention feature map corresponding to the 2 nd to nth original feature maps, a space attention feature map corresponding to the nth original feature map and the 1 st original feature map to obtain a target feature map of the image;

and the detection module is used for detecting the target in the automatic driving scene based on the target characteristic diagram.

In a third aspect of the embodiments of the present application, a computer device is provided, where the computer device includes a memory and a processor, and the memory stores a computer program, and the computer program, when executed by the processor, implements the feature extraction method for an image in the first aspect of the embodiments of the present application.

In a fourth aspect of the embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for extracting features of an image in the first aspect of the embodiments of the present application.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the method for extracting features of an image, provided by the embodiment of the present application, includes obtaining an image to be processed, performing convolution processing on the image by using n convolution structures in a preset convolution neural network model to obtain n original feature maps, where n is an integer greater than 4, then performing feature extraction on each original feature map from 2 nd to nth original feature maps by using a preset channel attention mechanism model to obtain a channel attention feature map corresponding to each original feature map from 2 nd to nth original feature maps, and performing feature extraction on the nth original feature map by using a preset spatial attention mechanism model to obtain a spatial attention feature map corresponding to the nth original feature map. And finally, performing layer-by-layer feature fusion based on the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map and the 1 st original feature map to obtain a target feature map of the image. Because the deep feature map has rich semantic information and the shallow feature map has accurate position information, the target feature map obtained by performing layer-by-layer feature fusion on the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map and the 1 st original feature map comprises both rich semantic information and accurate position information. Further, when the target feature map is used for target detection, the accuracy of target detection can be improved, and the experience of automatic driving can be improved.

Drawings

Fig. 1 is a schematic internal structural diagram of a computer device according to an embodiment of the present application;

fig. 2 is a first flowchart of a method for extracting features of an image according to an embodiment of the present disclosure;

fig. 3 is a second flowchart of a method for extracting features of an image according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a feature extraction process of an image according to an embodiment of the present application;

fig. 5 is a structural diagram of an image feature extraction device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present disclosure, "a plurality" means two or more unless otherwise specified.

In addition, the use of "based on" or "according to" means open and inclusive, as a process, step, calculation, or other action that is "based on" or "according to" one or more conditions or values may in practice be based on additional conditions or values beyond those that are present.

The real-time performance and the accuracy of target detection are quite high in special scenes such as driving scenes and industrial and mining with complex environments, the feature extraction of the image is the basis of various intensive calculation type downstream tasks such as target detection or image segmentation, and the accuracy of the subsequent downstream tasks such as target detection and image segmentation is directly influenced by the accuracy of the image feature extraction result.

In order to solve the above problem, an embodiment of the present application provides a method for extracting features of an image, where an image to be processed is obtained, n original feature maps are obtained by performing convolution processing on the image by using n convolution structures in a preset convolution neural network model, n is an integer greater than 4, then feature extraction is performed on each of the 2 nd to nth original feature maps by using a preset channel attention mechanism model, so as to obtain a channel attention feature map corresponding to each of the 2 nd to nth original feature maps, and feature extraction is performed on the nth original feature map by using a preset spatial attention mechanism model, so as to obtain a spatial attention feature map corresponding to the nth original feature map. And finally, performing layer-by-layer feature fusion based on the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map and the 1 st original feature map to obtain a target feature map of the image. Because the deep feature map has rich semantic information and the shallow feature map has accurate position information, the target feature map obtained by performing layer-by-layer feature fusion on the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map and the 1 st original feature map comprises both rich semantic information and accurate position information. Further, when the target feature map is used for target detection, the accuracy of target detection can be improved, and the experience of automatic driving can be improved.

In addition, the target feature map obtained by the image feature extraction method provided by the embodiment of the application has the same effect when used for detecting a travelable area, detecting a lane line and the like in an automatic driving scene.

An execution main body of the image feature extraction method provided by the embodiment of the present application may be a computer device, a terminal device, or a server, where the terminal device may be various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, and the like, and the present application is not particularly limited by this comparison.

Fig. 1 is a schematic internal structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 1, the computer device includes a processor and a memory connected by a system bus. Wherein the processor is configured to provide computational and control capabilities. The memory may include a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The computer program can be executed by a processor to implement the steps of a method for extracting features of an image provided in the above embodiments. The internal memory provides a cached execution environment for the operating system and computer programs in the non-volatile storage medium.

Those skilled in the art will appreciate that the architecture shown in fig. 1 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Based on the execution main body, the embodiment of the application provides a feature extraction method for an image. As shown in fig. 2, the method comprises the steps of:

step 201, acquiring an image to be processed.

In an automatic driving scenario, an image to be processed is acquired by a vehicle-mounted camera, the resolution of the image is usually 1280 × 720, and the image may include information such as a vehicle, a pedestrian, a road surface, a lane line, and the like. Object detection is often required for vehicle and pedestrian or other obstacle information included in the images.

Step 202, performing convolution processing on the image by using n convolution structures in a preset convolution neural network model to obtain n original feature maps, wherein n is an integer larger than 4.

The 1-n original feature maps are sequentially obtained through a preset convolutional neural network model, the length and width of the feature map of the next layer is usually 1/2 of the length and width of the previous layer, the number of channels is 2 times of the number of channels of the previous layer, and the method is specifically determined according to the backbone network, and is not specifically limited in this application.

In the following description, the whole process is described by taking n =5 as an example, but the present application is not limited to n =5.

If n =5, the number of convolution structures is 5, and the number of convolution feature maps is 5. The 1 st convolution structure outputs the 1 st original feature map, the 2 nd convolution structure outputs the 2 nd original feature map, the 3 rd convolution structure outputs the 3 rd original feature map, the 4 th convolution structure outputs the 4 th original feature map, and the 5 th convolution structure outputs the 5 th original feature map.

The process comprises the following steps: inputting the image into a 1 st convolution structure to obtain a 1 st original feature map, then inputting the first convolution feature map into a 2 nd convolution structure to obtain a 2 nd original feature map, then inputting the 2 nd original feature map into a 3 rd convolution structure to obtain a 3 rd original feature map, then inputting the 3 rd original feature map into a 4 th convolution structure to obtain a 4 th original feature map, and finally inputting the 4 th original feature map into a 5 th convolution structure to obtain a 5 th original feature map.

Step 203, performing feature extraction by using each original feature map in the 2 nd to nth original feature maps of the preset channel attention mechanism module to obtain a channel attention feature map corresponding to each original feature map in the 2 nd to nth original feature maps.

That is to say, the 2 nd original feature map, the 3 rd original feature map, the 4 th original feature map and the 5 th original feature map are respectively input into a model based on a channel attention mechanism for feature extraction, so as to respectively obtain a 2 nd channel attention feature map corresponding to the 2 nd original feature map, a 3 rd channel attention feature map corresponding to the 3 rd original feature map, a 4 th channel attention feature map corresponding to the 4 th original feature map and a 5 th channel attention feature map corresponding to the 5 th original feature map. Since the extracted features have more selective interest in the channel than the original features. Therefore, the 2 nd to nth original feature maps are selected, so that the semantic information under multiple scales can be acquired.

Specifically, based on the model of the channel attention mechanism, the feature map with the size of (H, W, C) is first converted into the feature map with the size of (1, C) using global average pooling, and pooling is achieved according to the following formula:

wherein, H, W and C are respectively the height, width and channel number of the characteristic diagram, x _c (i, j) is the value of i row, j column, c channel. And obtaining attention weights among channels by using two full-connection layers, obtaining normalized weights by using a sigmoid function, and finally multiplying the normalized weights by the original feature map on the channel dimension to obtain a channel attention feature map.

s _c ＝Sigmoid(fc2(Relu(fc1(p _c ,w1))),w2)

And 204, performing feature extraction on the nth original feature map by using a preset spatial attention mechanism module to obtain a spatial attention feature map corresponding to the nth original feature map.

And inputting the 5 th original feature map into a preset model based on a space attention mechanism for feature extraction to obtain a 5 th space attention feature map corresponding to the 5 th original feature map. The extracted features have more selective interest in spatial location than the original features. The reason for selecting n is that the size of the nth original feature map is small, and the calculation amount of the nth original feature map added with the attention module is small, so that the feature extraction efficiency can be improved while the obtained feature map is ensured to include both rich semantic information and accurate position information.

Among them, the Attention module of the model based on the spatial Attention mechanism uses RCCA (current Cross-Cross Attention), i.e., a circular Cross-shaped Attention module. Since the CCA module only focuses on cross-shaped spatial regions in the same row and column as the current element. While CCA focuses on context information in both horizontal and vertical directions, it is still sparse. And the RCCA acquires context information in a wider range in a criss-cross and iterative manner by continuously circulating the CCA module. The circular RCCA executes CCA for two times by default, after the CCA is performed for the first time, the attention weight in the horizontal direction and the attention weight in the vertical direction are obtained, and the current horizontal vertical direction in the CCA for the second time already contains the context information obtained after the last operation, so that the global semantic information can be indirectly obtained only by two CCA iterations, global rich semantic information can be obtained, meanwhile, the efficiency of feature extraction can be improved, and meanwhile, the RCCA has advantages in the aspects of calculation speed and memory consumption.

And step 205, performing layer-by-layer feature fusion on the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map and the 1 st original feature map to obtain a target feature map of the image.

And performing layer-by-layer feature fusion by using the 2 nd channel attention feature map, the 3 rd channel attention feature map, the 4 th channel attention feature map, the 5 th spatial attention feature map and the 1 st original feature map to obtain a target feature map of the image. And (4) adding an attention mechanism based on the basic characteristic pyramid network to enhance the characteristic extraction effect. Meanwhile, the shallow network position information is strong, and the deep network semantic information is strong. The original input image is an image, and the characteristic image is a series of matrixes for describing the input image.

And step 206, performing target detection in the automatic driving scene based on the target feature map.

The feature extraction method provided by the embodiment of the application can be used for obtaining the target feature map of the image, and tasks such as target detection, semantic segmentation and the like can be performed based on the target feature map.

The embodiment of the application provides a method for extracting features of an image, which includes the steps of obtaining the image to be processed, carrying out convolution processing on the image by using n convolution structures in a preset convolution neural network model to obtain n original feature maps, wherein n is an integer larger than 4, then carrying out feature extraction on each of the 2 nd to nth original feature maps by using a preset channel attention mechanism model to obtain a channel attention feature map corresponding to each of the 2 nd to nth original feature maps, and carrying out feature extraction on the nth original feature map by using a preset space attention mechanism model to obtain a space attention feature map corresponding to the nth original feature map. And finally, performing layer-by-layer feature fusion based on the channel attention feature map corresponding to the 2 nd to nth original feature maps, the space attention feature map corresponding to the nth original feature map and the 1 st original feature map to obtain a target feature map of the image. Because the deep feature map has rich semantic information and the shallow feature map has accurate position information, the target feature map obtained by performing feature fusion layer by layer based on the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map and the 1 st original feature map comprises both rich semantic information and accurate position information. Furthermore, the target feature map is used for tasks such as target detection or image segmentation, and the accuracy of task processing can be improved. In addition, since the use of convolution limits the size of the receptive field, the feature map after multiple convolutions still has a deficiency in global spatial information. And the use of attention mechanism models can solve the problem of limited reception field of the convolution network.

Optionally, as shown in fig. 3, in the step 205, performing layer-by-layer feature fusion based on the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map, and the 1 st original feature map, and obtaining the target feature map of the image may be:

step 301, determining a fusion map corresponding to the nth original feature map based on the spatial attention feature map and the channel attention feature map corresponding to the nth original feature map.

And step 302, performing layer-by-layer feature fusion based on the fusion graph and the space attention feature graph corresponding to the nth original feature graph, the channel attention feature graph corresponding to the (n-1) th to (2) th original feature graphs and the 1 st original feature graph to obtain a target feature graph of the image.

Optionally, the process of step 302 may be: and performing layer-by-layer feature fusion on the basis of the fusion graph corresponding to the nth original feature graph, the spatial attention feature graph and the channel attention feature graphs corresponding to the (n-1) th to 2 nd original feature graphs to obtain a fusion graph corresponding to the 2 nd original feature graph, and obtaining a target feature graph of the image on the basis of the fusion graph corresponding to the 2 nd original feature graph and the 1 st original feature graph.

Specifically, the process of performing layer-by-layer feature fusion based on the fusion map corresponding to the nth original feature map, the spatial attention feature map, and the channel attention feature maps corresponding to the (n-1) th to 2 nd original feature maps to obtain the fusion map corresponding to the 2 nd original feature map may be:

Taking n =5 as an example, the implementation process of the above step 301 and step 302 may be: firstly, a 5 th fusion map is obtained according to the 5 th channel attention feature map and the 5 th spatial attention feature map. And obtaining a 4 th fusion map according to the 5 th fusion map, the 4 th channel attention feature map and the 5 th spatial attention feature map. And obtaining a 3 rd fusion map according to the 4 th fusion map, the 3 rd channel attention feature map and the 5 th spatial attention feature map. And obtaining a 2 nd fusion map according to the 3 rd fusion map, the 2 nd channel attention feature map and the 5 th spatial attention feature map. And finally, obtaining a target feature map of the image according to the 2 nd fusion map and the 1 st original feature map. And performing layer-by-layer feature fusion on the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map and the 1 st original feature map to obtain a target feature map which comprises rich semantic information and accurate position information.

Optionally, in the step 301, based on the spatial attention feature map and the channel attention feature map corresponding to the nth original feature map, the process of determining the fusion map corresponding to the nth original feature map may be:

The information integration process may be add process. Illustratively, add processing is performed on the 5 th spatial attention feature map and the 5 th channel attention feature map to obtain a 5 th fusion map.

Optionally, the process of obtaining the target feature map of the image based on the fusion map corresponding to the 2 nd original feature map and the 1 st original feature map may be:

and performing up-sampling on the fusion graph corresponding to the 2 nd original feature graph to obtain a reference graph corresponding to the 2 nd original feature graph, and performing information integration processing on the reference graph corresponding to the 2 nd original feature graph and the 1 st original feature graph to obtain a target feature graph.

In an example, the 2 nd fusion graph is up-sampled to obtain the 2 nd reference graph. And then performing add processing on the 2 nd reference image and the 1 st original feature image to obtain a target feature image.

Optionally, the process of obtaining the fusion map corresponding to the ith original feature map according to the channel attention feature map corresponding to the ith original feature map, the spatial attention feature map corresponding to the nth original feature map, and the fusion map corresponding to the (i + 1) th original feature map may be:

performing deconvolution processing on the fusion graph corresponding to the (i + 1) th original feature graph to obtain a reference graph corresponding to the (i + 1) th original feature graph; carrying out up-sampling processing on the spatial attention feature map corresponding to the nth original feature map to obtain a middle map; and (4) carrying out information integration processing on the reference graph and the intermediate graph corresponding to the (i + 1) th original feature graph and the channel attention feature graph corresponding to the ith original feature graph to obtain a fusion graph corresponding to the ith original feature graph.

Taking n =5 as an example, that is, the process of obtaining the 4 th fusion map-2 nd fusion map may be:

and performing deconvolution processing on the 5 th fusion image to obtain a 5 th reference image, performing upsampling processing on the 5 th spatial attention feature image to obtain a middle image, and performing add processing on the 5 th reference image, the middle image and the 4 th channel attention feature image to obtain a 4 th fusion image. And performing deconvolution processing on the 4 th fusion image to obtain a 4 th reference image, performing up-sampling processing on the 5 th spatial attention feature image to obtain an intermediate image, and performing add processing on the 4 th reference image, the intermediate image and the 3 rd channel attention feature image to obtain a 3 rd fusion image. And performing deconvolution processing on the 3 rd fusion image to obtain a 3 rd reference image, performing up-sampling processing on the 5 th spatial attention feature image to obtain an intermediate image, and performing add processing on the 3 rd reference image, the intermediate image and the 2 nd channel attention feature image to obtain a 2 nd fusion image. Finally, the 4 th fusion picture-2 nd fusion picture is obtained.

The above-mentioned whole process of performing feature fusion layer by using the 2 nd channel attention feature map, the 3 rd channel attention feature map, the 4 th channel attention feature map, the 5 th spatial attention feature map and the 1 st original feature map to obtain the target feature map of the image may refer to fig. 4, where the data in the columns of 20x12 and 40x24 in fig. 4 identify the resolution of the image, and the data in the columns of 1/2 and 1/4 indicate the size of the image compared with the original image. C1, c2, c3, c4 and c5 in fig. 4 correspond to the 1 st original feature map, the 2 nd original feature map, the 3 rd original feature map, the 4 th original feature map and the 5 th original feature map, respectively. P2, p3, p4 and p5 in FIG. 4 correspond to the

fusion images

2, 3, 4 and 5, respectively. P1 in fig. 4 is the final target feature map. Here, head1, head2, and Head3 in fig. 4 are tasks to be input into the target feature map, and the tasks may be a target detection task and a semantic segmentation task. The SAM in FIG. 4 is a spatial attention model and the CAM is a channel attention model.

Compared with the feature map obtained by the conventional feature pyramid network, the target feature map obtained by the image feature extraction method provided by the embodiment of the application is improved in the processing effect of both the target detection task and the semantic segmentation task, and has data shown in table 1.

TABLE 1 comparative chart of experimental results

Model (model)	Target detection mAP	Semantic segmentation mIoU
			Feature map obtained by traditional feature pyramid network	0.401	0.371
Characteristic diagram obtained by the method of the application	0.490	0.383

As shown in fig. 5, an embodiment of the present application further provides an apparatus for extracting features of an image, where the apparatus includes:

an obtaining module 11, configured to obtain an image to be processed;

the convolution module 12 is configured to perform convolution processing on the image by using n convolution structures in a preset convolution neural network model to obtain n original feature maps, where n is an integer greater than 4;

the first processing module 13 is configured to perform feature extraction on each of the 2 nd to nth original feature maps by using a preset channel attention mechanism model to obtain a channel attention feature map corresponding to each of the 2 nd to nth original feature maps;

the second processing module 14 is configured to perform feature extraction on the nth original feature map by using a preset spatial attention mechanism model to obtain a spatial attention feature map corresponding to the nth original feature map;

the determining module 15 is configured to perform layer-by-layer feature fusion on the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map, and the 1 st original feature map to obtain a target feature map of the image;

and the detection module 16 is used for detecting the target in the automatic driving scene based on the target feature map.

In one embodiment, the determining module 15 is specifically configured to:

and performing layer-by-layer feature fusion based on the fusion graph and the space attention feature graph corresponding to the nth original feature graph, the channel attention feature graph corresponding to the (n-1) th to the (2) th original feature graphs and the 1 st original feature graph to obtain a target feature graph of the image.

In one embodiment, the determining module 15 is specifically configured to:

The image feature extraction device provided in this embodiment may implement the method embodiments described above, and the implementation principle and the technical effect are similar, which is not described herein again.

For specific definition of the feature extraction device of the image, reference may be made to the definition of the feature extraction method of the image, and details are not described here. The modules in the image feature extraction device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the server, and can also be stored in a memory in the server in a software form, so that the processor can call and execute operations corresponding to the modules.

In another embodiment of the present application, there is also provided a computer device including a memory and a processor, the memory storing a computer program, and the computer program being executed by the processor to implement the steps of the feature extraction method of an image according to the embodiment of the present application.

In another embodiment of the present application, a computer-readable storage medium is further provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the feature extraction method of an image according to the embodiment of the present application.

In another embodiment of the present application, a computer program product is also provided, where the computer program product includes computer instructions that, when executed on a feature extraction apparatus for an image, cause the feature extraction apparatus for the image to perform the steps performed by the feature extraction method for the image in the method flow shown in the above-mentioned method embodiment.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The processes or functions according to the embodiments of the present application are generated in whole or in part when the computer-executable instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for extracting features of an image, the method comprising:

acquiring an image to be processed;

performing layer-by-layer feature fusion on the basis of the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map and the 1 st original feature map to obtain a target feature map of the image;

and performing target detection in an automatic driving scene based on the target feature map.

2. The method according to claim 1, wherein the obtaining of the target feature map of the image based on layer-by-layer feature fusion of the channel attention feature map corresponding to the 2 nd to nth original feature maps, the spatial attention feature map corresponding to the nth original feature map, and the 1 st original feature map comprises:

and performing layer-by-layer feature fusion on the basis of the fusion graph and the space attention feature graph corresponding to the nth original feature graph, the channel attention feature graph corresponding to the (n-1) th to the (2) th original feature graphs and the 1 st original feature graph to obtain the target feature graph of the image.

3. The method according to claim 2, wherein performing layer-by-layer feature fusion based on the fusion map and the spatial attention feature map corresponding to the nth original feature map, the channel attention feature map corresponding to the (n-1) th to 2 nd original feature maps, and the 1 st original feature map to obtain the target feature map of the image comprises:

4. The method according to claim 3, wherein performing layer-by-layer feature fusion based on the fusion map corresponding to the nth original feature map, the spatial attention feature map, and the channel attention feature maps corresponding to the (n-1) th to 2 nd original feature maps to obtain a fusion map corresponding to the 2 nd original feature map comprises:

the mth fusion processing process comprises the following steps: and obtaining a fusion graph corresponding to the ith original feature map according to the channel attention feature map corresponding to the ith original feature map, the space attention feature map corresponding to the nth original feature map and the fusion graph corresponding to the (i + 1) th original feature map, wherein i is an integer from (n-1) to 2 in sequence.

5. The method according to claim 4, wherein the obtaining a fused graph corresponding to an ith original feature map according to a channel attention feature map corresponding to the ith original feature map, a spatial attention feature map corresponding to the nth original feature map and a fused graph corresponding to an (i + 1) th original feature map comprises:

performing up-sampling processing on the spatial attention feature map corresponding to the nth original feature map to obtain a middle map;

and performing information integration processing on the reference graph corresponding to the (i + 1) th original feature graph, the intermediate graph and the channel attention feature graph corresponding to the ith original feature graph to obtain a fusion graph corresponding to the ith original feature graph.

6. The method according to claim 3, wherein obtaining the target feature map of the image based on the fused map corresponding to the 2 nd original feature map and the 1 st original feature map comprises:

performing up-sampling on the fusion graph corresponding to the 2 nd original feature graph to obtain a reference graph corresponding to the 2 nd original feature graph;

and performing information integration processing on the reference image corresponding to the 2 nd original feature image and the 1 st original feature image to obtain the target feature image.

7. The method according to claim 2, wherein the determining a fused graph corresponding to the nth original feature map based on the spatial attention feature map and the channel attention feature map corresponding to the nth original feature map comprises:

and performing information integration processing on the spatial attention feature map corresponding to the nth original feature map and the channel attention feature map corresponding to the nth original feature map to obtain a fusion map corresponding to the nth original feature map.

8. An apparatus for extracting features of an image, the apparatus comprising:

the acquisition module is used for acquiring an image to be processed;

the convolution module is used for carrying out convolution processing on the image by utilizing n convolution structures in a preset convolution neural network model to obtain n original characteristic graphs, wherein n is an integer larger than 4;

the second processing module is used for performing feature extraction on the nth original feature map by using a preset spatial attention mechanism model to obtain a spatial attention feature map corresponding to the nth original feature map;

a determining module, configured to perform layer-by-layer feature fusion on a channel attention feature map corresponding to the 2 nd to nth original feature maps, a spatial attention feature map corresponding to the nth original feature map, and a 1 st original feature map to obtain a target feature map of the image;

9. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, implements a feature extraction method for an image according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements a method of feature extraction of an image according to any one of claims 1 to 7.