CN112990299A

CN112990299A - Depth map acquisition method based on multi-scale features, electronic device and storage medium

Info

Publication number: CN112990299A
Application number: CN202110265024.XA
Authority: CN
Inventors: 常青玲; 崔岩; 杨鑫; 任飞; 戴成林; 胡昊; 李敏华
Original assignee: Wuyi University
Current assignee: Wuyi University
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-06-18
Anticipated expiration: 2041-03-11
Also published as: CN112990299B

Abstract

The invention provides a depth map acquisition method, a device and a storage medium based on multi-scale features, wherein the method comprises the following steps: acquiring an input image, and acquiring a multi-scale feature block according to the input image; obtaining a pooling feature map according to the multi-scale feature block, and obtaining a channel attention block according to the pooling feature map, wherein the channel attention block represents the relation among a plurality of features; obtaining an original fusion feature map according to the multi-scale feature block, and obtaining a target fusion feature map according to the original fusion feature map and the channel attention block; and splicing the target fusion characteristic graph and the original special effect graph and carrying out decoding operation to obtain a target depth graph. According to the scheme provided by the embodiment of the invention, the relation among the multi-scale features can be enhanced through the channel attention block, so that the multi-scale features can express the object information better, and the definition of the object information of the target depth map is effectively improved.

Description

Depth map acquisition method based on multi-scale features, electronic device and storage medium

Technical Field

The present invention relates to, but not limited to, the field of image processing, and in particular, to a method, an apparatus, and a storage medium for obtaining a depth map based on multi-scale features.

Background

With the development of image processing technology, the application of depth maps is more and more extensive. The depth map can be obtained by processing based on an image captured by a terminal, and generally requires input images at a plurality of angles in order to improve the definition of the depth map and the accuracy of the object structure. However, the multi-angle input image requires a terminal to be equipped with a plurality of cameras, and the cost of hardware is high. For a monocular camera, although some depth learning algorithms can obtain a depth map based on an input image of the monocular camera, since only one camera is provided, the angle of the obtained input image is limited, and a conventional image processing method usually loses a large amount of object information, so that the depth map is blurred, and the object structure is unclear.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the invention provides a depth map acquisition method, depth map acquisition equipment and a storage medium based on multi-scale features, which can improve the definition of object information of a depth map.

In a first aspect, an embodiment of the present invention provides a depth map obtaining method based on multi-scale features, including:

acquiring an input image, and acquiring a multi-scale feature block according to the input image;

obtaining a pooling feature map according to the multi-scale feature block, and obtaining a channel attention block according to the pooling feature map, wherein the channel attention block represents the relation among a plurality of features;

obtaining an original fusion feature map according to the multi-scale feature block, and obtaining a target fusion feature map according to the original fusion feature map and the channel attention block;

and decoding the target fusion feature map to obtain a target depth map.

The embodiment of the invention comprises the following steps: acquiring an input image, and acquiring a multi-scale feature block according to the input image; obtaining a pooling feature map according to the multi-scale feature block, and obtaining a channel attention block according to the pooling feature map, wherein the channel attention block represents the relation among a plurality of features; obtaining an original fusion feature map according to the multi-scale feature block, and obtaining a target fusion feature map according to the original fusion feature map and the channel attention block; and decoding the target fusion feature map to obtain a target depth map. According to the scheme provided by the embodiment of the invention, the relation among the multi-scale features can be enhanced through the channel attention block, so that the multi-scale features can express the object information better, and the definition of the object information of the target depth map is effectively improved.

As a further improvement of the present invention, the obtaining a multi-scale feature block according to the input image includes:

obtaining initial characteristics according to the input image;

acquiring a preset multi-scale feature fusion network, and performing feature aggregation on the initial features through the multi-scale feature fusion network to obtain a plurality of aggregation features of different scales;

stitching a plurality of the aggregated features to obtain the multi-scale feature block.

As a further improvement of the present invention, before the stitching a plurality of the aggregated features to obtain the multi-scale feature block, the method further comprises:

compressing a plurality of the aggregated features to the same number of channels.

As a further improvement of the present invention, the obtaining a pooling feature map according to the multi-scale feature block and obtaining a channel attention block according to the pooling feature map includes:

performing global pooling on the multi-scale feature blocks to obtain a pooled feature map;

and sequentially performing a compression operation and an activation operation on the pooled feature map to obtain the channel attention block.

As a further improvement of the present invention, before obtaining the target fused feature map according to the original fused feature map and the channel attention block, the method further comprises:

compressing the original fused feature map such that the number of channels of the original fused feature map is the same as the number of channels of the pooled feature map.

As a further improvement of the present invention, the decoding operation performed on the target fusion feature map to obtain a target depth map includes:

performing channel connection according to the target fusion feature map and the original fusion feature map to obtain a reference scale feature block;

compressing the reference scale feature block at least twice to obtain a plurality of initial depth maps with different scales;

and summing pixels of the plurality of initial depth maps to obtain the target depth map.

As a further improvement of the present invention, the pixel summing a plurality of the initial depth maps comprises:

acquiring a preset self-adaptive weight, wherein the self-adaptive weight corresponds to each scale;

performing pixel summation according to the adaptive weight and the plurality of initial depth maps, wherein the formula of the pixel summation is as follows: d ═ Σ_k∈lwk × dk; and D is the target depth map, wherein dk is the initial depth map of the kth scale, wk is the adaptive weight of the kth scale, and D is the target depth map.

As a further improvement of the invention, the sum of all the adaptive weights is 1.

In a second aspect, an embodiment of the present invention further provides an apparatus, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the method for acquiring a depth map based on multi-scale features according to the first aspect.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

Fig. 1 is a flowchart of a depth map acquisition method based on multi-scale features according to an embodiment of the present invention;

FIG. 2 is a flow chart for obtaining a multi-scale feature block according to another embodiment of the present invention;

FIG. 3 is a block diagram of a multi-scale fusion feature network according to another embodiment of the present invention;

FIG. 4 is a block diagram of a channel attention block provided in accordance with another embodiment of the present invention;

FIG. 5 is a flow diagram of a compression aggregation feature provided by another embodiment of the present invention;

FIG. 6 is a flow chart of a get channel attention block provided by another embodiment of the present invention;

FIG. 7 is a flow diagram of compressing a raw fused feature map as provided by another embodiment of the present invention;

FIG. 8 is a flow chart for obtaining a target depth map according to another embodiment of the present invention;

fig. 9 is a diagram of a network architecture for decoding provided by another embodiment of the present invention;

FIG. 10 is a flow chart of pixel summing provided by another embodiment of the present invention;

FIG. 11 is a diagram of an overall network architecture provided by another embodiment of the present invention;

fig. 12 is a device diagram of an apparatus provided by another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms "first," "second," and the like in the description, in the claims, or in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The embodiments of the present invention will be further explained with reference to the drawings.

As shown in fig. 1, fig. 1 is a depth map obtaining method based on multi-scale features according to an embodiment of the present invention, including:

step S110, acquiring an input image, and obtaining a multi-scale fusion feature block according to the input image;

step S120, a pooling feature map is obtained according to the multi-scale feature block, and a channel attention block is obtained according to the pooling feature map, wherein the channel attention block represents the relation among a plurality of features;

step S130, obtaining an original fusion feature map according to the multi-scale feature block, and obtaining a target fusion feature map according to the original fusion feature map and the channel attention block;

and step S140, decoding the target fusion feature map to obtain a target depth map.

It should be noted that the input image may be a photograph or an image containing information of any object, and may be acquired by a monocular camera, and the specific content in the image is not limited in this embodiment. It can be understood that, a method in the prior art may be adopted to obtain a multi-scale feature block according to an input image, and this embodiment does not limit a specific multi-scale feature obtaining method, and may obtain a feature block composed of features of multiple scales.

It should be noted that, by obtaining the pooled feature map through pooling, features can be compressed, complexity is reduced, and features other than object information are reduced, so that the obtained channel attention block represents the relationship among a plurality of features, thereby achieving the effect of strengthening the relationship among the features during fusion. For example, a Dense Feature Fusion Network (DFFN) shown in fig. 3 may be adopted, and the structure shown in fig. 3 is only an example and does not limit the technical solution of the present application. Can be used for

It can be understood that for the image acquired by the monocular camera, the relevance of the object information in the target depth map obtained by decoding is stronger by strengthening the relation between the features, the definition of the depth map is effectively improved, and more accurate object information is obtained.

It is to be understood that the target fused feature map may be obtained by performing a dot product on the original fused feature map and the channel attention block, or may be obtained by performing other methods that can achieve the same effect, which is not limited herein.

It should be noted that any type of network may be used for decoding the target Fusion feature map, and a multi-scale Depth map Fusion Module (DAFM) is preferably used in this embodiment, and a structure of the dapm is shown in fig. 9, where it should be noted that the structure shown in fig. 9 is merely an example, and does not limit the technical solution of the present application, and the multi-scale Depth map Fusion Module can simultaneously decode and adaptively fuse features of multiple scales, and can effectively improve decoding efficiency in the case of many parameters.

In addition, referring to fig. 2, in an embodiment, step S110 in the embodiment shown in fig. 2 further includes, but is not limited to, the following steps:

step S210, obtaining initial characteristics according to an input image;

step S220, acquiring a preset multi-scale feature fusion network, and performing feature aggregation on the initial features through the multi-scale feature fusion network to obtain a plurality of aggregation features with different scales;

and step S230, splicing the multiple aggregation features to obtain a multi-scale feature block.

It should be noted that, on the premise of the DFFN shown in fig. 3, 5 approximate sub-modules are adopted to perform upsampling and downsampling on the initial features of the input image, and then perform feature aggregation on the initial features to obtain aggregated features, and through aggregation operation, information loss in the feature extraction process can be effectively reduced, so that the obtained target depth map can reflect more object information.

It is noted that, because the connection among a plurality of features gradually fades after a plurality of convolution upsampling and downsampling, the connection among the features can be further strengthened through the channel attention block, and the definition of the target depth map is ensured.

In addition, referring to fig. 5, in an embodiment, before performing step S230 in the embodiment shown in fig. 2, the following steps are further included, but not limited to:

step S510, compress the multiple aggregated features to the same number of channels.

It should be noted that, since the aggregation features need to be spliced, in order to ensure the accuracy of splicing, the aggregation features need to be compressed before splicing to ensure that the number of channels is the same.

In addition, referring to fig. 6, in an embodiment, step S120 in the embodiment shown in fig. 1 further includes, but is not limited to, the following steps:

step S610, performing global pooling on the multi-scale feature blocks to obtain a pooled feature map;

and step S620, sequentially performing compression operation and activation operation on the pooled feature map to obtain a channel attention block.

It should be noted that after obtaining the multi-scale feature block from the concatenation in the DFFN, the compressing operation and the activating operation can strengthen the features with value and eliminate the useless information, and the specific compressing operation and the activating operation can refer to the following examples:

referring to fig. 4, using a global average pooling layer to pool the multi-scale feature blocks into a 1 × 1 × C pooled feature map, compressing the pooled feature map by a 1 × 1 convolution, and then activating using a ReLU function, completing the first compression and activation; the pooled feature map is convolved again with a 1 × 1 convolution and activated using Sigmoid function without compression, resulting in the channel attention block shown in fig. 4.

In addition, referring to fig. 7, in an embodiment, step S130 in the embodiment shown in fig. 1 further includes, but is not limited to, the following steps:

step S710, compress the original fused feature map to make the number of channels of the original fused feature map the same as the number of channels of the pooled feature map.

It should be noted that the channel attention block is obtained from the pooled feature map, so the number of channels is the same as the pooled feature map, and the original fused feature map and the channel attention block need to be point-multiplied to obtain the target fused feature map, so the number of channels in the original fused feature map needs to be ensured to be the same as the number of channels in the pooled feature map by compression.

In addition, referring to fig. 8, in an embodiment, step S140 in the embodiment shown in fig. 1 further includes, but is not limited to, the following steps:

step S810, channel connection is carried out according to the target fusion feature map and the original fusion feature map to obtain a reference scale feature block;

step S820, compressing the reference scale feature block at least twice to obtain a plurality of initial depth maps with different scales;

step S830, performing pixel summation on the multiple initial depth maps to obtain a target depth map.

It should be noted that, as shown in fig. 11, the network structure diagram of this embodiment may obtain a target fusion feature diagram through DFFN after obtaining an original fusion feature diagram according to decomposition of an input image, and input the target fusion feature diagram and the target fusion feature diagram to DAFM for channel connection after channel connection to obtain a reference scale feature block.

It can be understood that, in order to improve the expression capability of the network, a step-by-step compression method may be used, and two convolutions of 3 × 3 are used to compress the reference scale feature block, so as to obtain the initial depth maps of multiple scales.

In addition, referring to fig. 10, in an embodiment, step S830 in the embodiment shown in fig. 8 further includes, but is not limited to, the following steps:

step S1010, acquiring preset self-adaptive weight, wherein the self-adaptive weight corresponds to each scale;

step S1020, performing pixel summation according to the adaptive weight and the plurality of initial depth maps, where a formula of the pixel summation is: d ═ Σ_k∈lwk × dk; and D is the target depth map.

It should be noted that an adaptive weight may be set for the initial depth map through a convolution of 1 × 1, and the adaptive weight may be learned through an existing standard back propagation mechanism, and may embody a characteristic weight, which is not limited in this embodiment. And finally, summing all the initial depth maps at the pixel level to obtain a final target depth map.

In the target depth map, the depth value Y of the (i, j) th pixel_i,jSatisfies the following formula:

wherein

Is the depth value at (i, j) of the initial feature at the k-th scale. Alpha is alpha_i，j，β_i，j，γ_i，j,δ_i,j,ε_i,jIs the adaptive weight corresponding to each scale.

In addition, in one embodiment, the sum of all the adaptive weights is 1.

It should be noted that, in order to avoid overfitting, the adaptive weights may be mathematically constrained such that the sum of all adaptive weights is 1For example, the adaptive weights in the above embodiments satisfy the following relation: alpha is alpha_i,j+β_i,j+γ_i,j+δ_i,j+ε_i,j＝1。

Additionally, referring to fig. 12, an embodiment of the present invention also provides an apparatus 1200 including: memory 1210, processor 1220, and computer programs stored on memory 1210 and operable on processor 1220.

The processor 1220 and the memory 1210 may be connected by a bus or other means.

Non-transitory software programs and instructions required to implement the multi-scale feature-based depth map acquisition method of the above-described embodiment are stored in the memory 1210, and when executed by the processor 1220, perform the multi-scale feature-based depth map acquisition method of the above-described embodiment, for example, the method steps S110 to S140 in fig. 1, the method steps S210 to S230 in fig. 2, the method step S510 in fig. 5, the method steps S610 to S620 in fig. 6, the method step S710 in fig. 7, the method steps S810 to S830 in fig. 8, and the method steps S1010 to S1020 in fig. 10, which are described above, are performed.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, which are executed by a processor or a controller, for example, by a processor in the above-mentioned apparatus embodiment, and enable the processor to perform the depth map acquisition method based on the multi-scale features in the above-mentioned embodiment, for example, the method steps S110 to S140 in fig. 1, the method steps S210 to S230 in fig. 2, the method step S510 in fig. 5, the method steps S610 to S620 in fig. 6, the method step S710 in fig. 7, the method steps S810 to S830 in fig. 8, and the method steps S1010 to S1020 in fig. 10, which are described above, are performed. One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims

1. A depth map acquisition method based on multi-scale features is characterized by comprising the following steps:

and decoding the target fusion feature map to obtain a target depth map.

2. The method of claim 1, wherein obtaining a multi-scale feature block from the input image comprises:

obtaining initial characteristics according to the input image;

3. The method of claim 2, wherein prior to stitching the plurality of aggregated features to obtain the multi-scale feature block, the method further comprises:

4. The method of claim 1, wherein obtaining a pooled feature map from the multi-scale feature block and a channel attention block from the pooled feature map comprises:

5. The method of claim 1, wherein prior to said deriving a target fused feature map from said original fused feature map and said channel attention block, said method further comprises:

6. The method of claim 1, wherein the decoding the target fused feature map to obtain a target depth map comprises:

7. The method of claim 6, wherein said pixel summing a plurality of said initial depth maps comprises:

8. The method of claim 7, wherein: the sum of all the adaptive weights is 1.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for obtaining a depth map based on multi-scale features according to any one of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium storing computer-executable instructions for performing the multi-scale feature based depth map acquisition method according to any one of claims 1 to 8.