CN117058380B

CN117058380B - Multi-scale lightweight three-dimensional point cloud segmentation method and device based on self-attention

Info

Publication number: CN117058380B
Application number: CN202311022399.9A
Authority: CN
Inventors: 张新钰; 谢涛; 王力; 李效宇; 刘德东; 郭世纯; 李志伟
Original assignee: Beijing Xuetuling Education Technology Co ltd
Current assignee: Beijing Xuetuling Education Technology Co ltd
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2024-03-26
Anticipated expiration: 2043-08-15
Also published as: CN117058380A

Abstract

The application provides a multi-scale lightweight three-dimensional point cloud segmentation method and device based on self-attention, which relate to the technical field of automatic driving and comprise the following steps: processing a two-dimensional image by using a multi-scale cavity convolution model which is trained in advance to obtain a first feature map; processing the two-dimensional image by using a pre-trained width dimension downsampling model to obtain a second feature map, a third feature map and a fourth feature map; processing the two-dimensional image, the second feature map, the third feature map and the fourth feature map by utilizing the pre-trained spatial attention model to obtain a fifth feature map; processing the fourth feature map and the fifth feature map by utilizing a pre-trained width dimension up-sampling model to obtain a sixth feature map; and processing the sixth feature map by using the channel attention model which is trained in advance to obtain a point cloud segmentation result. The method and the device can simultaneously extract the remarkable characteristics of the large target and the small target, and have small calculated amount.

Description

Multi-scale lightweight three-dimensional point cloud segmentation method and device based on self-attention

Technical Field

The application relates to the technical field of automatic driving, in particular to a multi-scale lightweight three-dimensional point cloud segmentation method and device based on self-attention.

Background

At present, two semantic segmentation methods of point cloud data exist, namely, the first method is to directly process the point cloud data and directly transmit the point cloud data into a neural network for learning through a PointNet framework; the second method is to voxel the point cloud data, and because the point cloud data are sparse and huge, the two methods require huge calculation cost and are not suitable for real-time application.

In addition, the 3D point cloud data can be converted into 2D image data through a spherical surface in the prior art, and then the characteristic of the target is extracted by adopting efficient convolution and deconvolution operation, which achieves remarkable performance on large-size objects (such as automobiles), however, the performance on small-size objects (such as pedestrians) is poor, because the method cannot extract the remarkable characteristics of the large objects and the small objects at the same time.

Disclosure of Invention

In view of the above, the present application provides a multi-scale lightweight three-dimensional point cloud segmentation method and apparatus based on self-attention, so as to solve the above technical problems.

In a first aspect, an embodiment of the present application provides a multi-scale lightweight three-dimensional point cloud segmentation method based on self-attention, including:

converting the original three-dimensional point cloud data into a two-dimensional image through spherical transformation;

processing a two-dimensional image by using a multi-scale cavity convolution model which is trained in advance to obtain a first feature map;

processing the two-dimensional image by using a pre-trained width dimension downsampling model to obtain a second feature map, a third feature map and a fourth feature map;

processing the two-dimensional image, the second feature map, the third feature map and the fourth feature map by utilizing the pre-trained spatial attention model to obtain a fifth feature map;

processing the fourth feature map and the fifth feature map by utilizing a pre-trained width dimension up-sampling model to obtain a sixth feature map;

and processing the sixth feature map by using the channel attention model which is trained in advance to obtain a point cloud segmentation result.

Further, converting the original three-dimensional point cloud data into a two-dimensional image through spherical transformation; comprising the following steps:

acquiring three-dimensional coordinates (x, y, z) of each point in the three-dimensional point cloud data;

according to a spherical transformation formula, calculating zenith angle alpha and azimuth angle beta of each point:

calculating the line pixels of each point on the two-dimensional image according to the zenith angle alpha and the azimuth angle beta of the pointAnd column pixels->

Wherein Δα and Δβ represent the row resolution and column resolution of the discretized point cloud;

thereby obtaining a two-dimensional image X with the size of H×W×C _input Where H, W and C represent the height, width and number of channels, respectively, of the two-dimensional image.

Further, the multi-scale hole convolution model includes: a first convolution layer of a 3 x 3 convolution kernel, a parallel multi-channel hole convolution unit, and a global average pooling layer, and a first adder; the multi-channel hole convolution unit comprises four parallel first hole convolution branches, a second hole convolution branch, a third hole convolution branch and a fourth hole convolution branch and a splicing unit; the first cavity convolution branch comprises a second convolution layer with a convolution kernel size of 1 multiplied by 1 and a first cavity convolution layer with a convolution kernel size of 3 multiplied by 3, wherein rate=1 which are connected; the second hole convolution branch comprises a first 3×3 average pooling layer and a second hole convolution layer with a convolution kernel size of 3×3 and rate=12 which are connected; the third hole convolution branch comprises a second average pooling layer of 5×5 and a third hole convolution layer with a convolution kernel size of 3×3 and rate=24 which are connected; the fourth hole convolution branch comprises a 7×7 third average pooling layer and a fourth hole convolution layer with a convolution kernel size of 3×3 and rate=36 which are connected;

processing a two-dimensional image by using a multi-scale cavity convolution model which is trained in advance to obtain a first feature map; comprising the following steps:

two-dimensional image X using a first convolution layer _input Processing to obtain a characteristic diagram X with the size of H multiplied by W multiplied by C;

processing the feature image X by using a first cavity convolution branch to obtain a feature image X with the size ofIs characterized by (a)

Processing the feature map X by using a second cavity convolution branch to obtain a feature map X with the size ofIs characterized by (a)

Processing the feature map X by using a third cavity convolution branch to obtain a feature map X with the size ofIs characterized by (a)

Processing the feature image X by using a fourth cavity convolution branch to obtain a feature image X with the size ofIs characterized by (a)

Feature map alignment using stitching unitFeature map->Feature map->And feature map->Splicing in the channel dimension to obtain a feature map of H W C>

Processing the feature map X by using a global average pooling layer to obtain a feature map with the size of 1 multiplied by C, and expanding the feature map into a feature map with the size of H multiplied by W multiplied by C by a broadcasting mechanism

The first adder is used for comparing the characteristic diagram X with the characteristic diagramAnd feature map->Performing addition operation to obtain a first characteristic diagram Y with the size of H multiplied by W multiplied by C ₁ 。

Further, the width dimension downsampling model comprises a first Fire module, a second Fire module, a third convolution layer of a 1×1 convolution kernel, a third Fire module, a fourth convolution layer of the 1×1 convolution kernel, a fifth Fire module and a sixth Fire module which are sequentially connected;

processing the two-dimensional image by using a pre-trained width dimension downsampling model to obtain a second feature map, a third feature map and a fourth feature map; comprising the following steps:

first feature map Y by using first Fire module ₁ Processing the output result of the first Fire module by using the second Fire module to obtain a second characteristic diagram Y with the size of G multiplied by W multiplied by C ₂ ；

Second feature map Y of third convolutional layer pair with 1×1 convolutional kernel ₂ Processing to obtain a product with a size ofFeature map of->

Feature map by using third Fire moduleProcessing, namely processing the output result of the third Fire module by using the fourth Fire module to obtain the size of +.>Third feature map Y of (2) ₃ ；

Third feature map Y of a fourth convolutional layer pair with a 1×1 convolutional kernel ₃ Processing to obtain a product with a size ofFeature map of->

Feature map using fifth Fire moduleProcessing, namely processing the output result of the fifth Fire module by using the sixth Fire module to obtain the size of +.>Fourth feature map Y of (2) ₄ 。

Further, the spatial attention model includes: a fifth convolution layer of four parallel 1 x 1 convolution kernels, a sixth convolution layer of 1 x 1 convolution kernels, a seventh convolution layer of 1 x 1 convolution kernels and an eighth convolution layer of 1 x 1 convolution kernels, a second adder and a spatial attention module;

processing the two-dimensional image, the second feature map, the third feature map and the fourth feature map by utilizing the pre-trained spatial attention model to obtain the second feature map, the third feature map and the fourth feature map; comprising the following steps:

two-dimensional image X using a fifth convolution layer _input Processing to obtain a product with a size of Feature map Z of (2) ₁ ；

Second feature map Y using a sixth convolution layer ₂ Processing to obtain a product with a size ofFeature map Z of (2) ₂ ；

Third feature map Y with seventh convolution layer ₃ Processing to obtain a product with a size ofFeature map Z of (2) ₃ ；

Fourth feature map Y with eighth convolutional layer ₄ Processing to obtain a product with a size ofFeature map Z of (2) ₄ ；

Using a second adder to compare the characteristic diagram Z ₁ Feature map Z ₂ Feature map Z ₃ And feature map Z ₄ Performing addition operation to obtain a size ofFeature map Z of (2) ₅ ；

Feature map Z with spatial attention module ₅ Processing to obtain a product with a size ofIs a fifth feature map Z of (a).

Further, the width dimension upsampling model includes: a double up-sampling layer, a quadruple up-sampling layer and a roll-back integration branch in parallel; the deconvolution integral branch comprises a third adder, a first Fire deconvolution layer, a fourth adder, a second Fire deconvolution layer, a fifth adder, a third Fire deconvolution layer and a ninth convolution layer of a 1 multiplied by 1 convolution kernel which are sequentially connected;

processing the fourth feature map and the fifth feature map by utilizing a pre-trained width dimension up-sampling model to obtain a sixth feature map; comprising the following steps:

fourth feature map Y with third adder ₄ Performing addition operation with the fifth characteristic diagram Z to obtain a characteristic diagram Q ₁ ；

Processing the feature map Q by using a first Fire deconvolution layer to obtain a feature map Q with a size ofFeature map Q of (2) ₂ ；

Processing the fifth feature map Z by using the double up-sampling layer to obtain a size ofFeature map Q ₃ ；

Using a fourth adder to make the characteristic diagram Q ₂ And feature map Q ₃ Performing addition operation to obtain a size ofFeature map Q of (2) ₄ ；

Feature map Q using a second Fire deconvolution layer ₄ Processing to obtain a characteristic diagram Q with the size of H multiplied by W multiplied by C ₅ ；

Processing the fifth feature map Z by using a quadruple upsampling layer to obtain a feature map Q with the size of H multiplied by W multiplied by C ₆ ；

Feature map Q using fifth adder ₅ And feature map Q ₆ Performing addition operation to obtain a characteristic diagram Q with the size of H multiplied by W multiplied by C ₇ ；

Feature map Q using a third Fire deconvolution layer ₇ Processing to obtain a characteristic diagram Q with the size of H multiplied by 2W multiplied by C ₈ ；

Ninth convolution layer pair feature map Q using 1 x 1 convolution kernel ₈ The sixth feature map Q is processed to obtain a size h×w×k, where K represents the number of classes of the division object.

Further, the method further comprises: and performing joint training on the multi-scale cavity convolution model, the width dimension downsampling model, the spatial attention model, the width dimension upsampling model and the channel attention model.

In a second aspect, embodiments of the present application provide a multi-scale lightweight three-dimensional point cloud segmentation apparatus based on self-attention, including:

the preprocessing unit is used for converting the original three-dimensional point cloud data into a two-dimensional image through spherical transformation;

the first processing unit is used for processing the two-dimensional image by utilizing the multi-scale cavity convolution model which is trained in advance to obtain a first feature map;

the downsampling unit is used for processing the two-dimensional image by utilizing a width dimension downsampling model which is trained in advance to obtain a second characteristic image, a third characteristic image and a fourth characteristic image;

the second processing unit is used for processing the two-dimensional image, the second feature map, the third feature map and the fourth feature map by utilizing the pre-trained spatial attention model to obtain a fifth feature map;

the up-sampling unit is used for processing the fourth characteristic diagram and the fifth characteristic diagram by utilizing a pre-trained width dimension up-sampling model to obtain a sixth characteristic diagram;

and the point cloud segmentation unit is used for processing the sixth feature map by utilizing the channel attention model which is trained in advance to obtain a point cloud segmentation result.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the methods of the embodiments of the present application when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions that, when executed by a processor, implement a method of embodiments of the present application.

The method adopts the combination of a spatial attention mechanism and a channel attention mechanism to extract semantic segmentation features of targets with different sizes, and utilizes multi-scale cavity convolution to obtain context information of the whole target on a plurality of scales, so that the salient features of large objects and small objects are extracted simultaneously. In order to reduce the parameters and the calculation cost, fireModule and FireDeconv (convolution and deconvolution modules) are adopted to realize a lightweight network.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a multi-scale lightweight three-dimensional point cloud segmentation method based on self-attention according to an embodiment of the present application;

FIG. 2 is a block diagram of a multi-scale cavity convolution model, a width dimension downsampling model, a spatial attention model, a width dimension upsampling model, and a channel attention model provided in an embodiment of the present application;

fig. 3 is a functional block diagram of a multi-scale light three-dimensional point cloud segmentation apparatus based on self-attention according to an embodiment of the present application;

fig. 4 is a functional block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

First, the design concept of the embodiment of the present application will be briefly described.

In autopilot, the camera can only capture the appearance information of the scene, and its spatial information cannot be estimated directly, and even a binocular depth camera, its positioning accuracy is much less than that of a lidar. Moreover, the detection based on the camera data is greatly influenced by the external environment (such as extreme weather, etc.), so that the robustness of the segmentation system cannot be ensured.

At present, two common methods for realizing semantic segmentation through point cloud data are available, the first method is to directly process the point cloud data and directly transmit the point cloud data into a neural network for learning through a Pointnet framework; the second method is to voxel the point cloud data, and because the point cloud data are sparse and huge, the two methods require huge calculation cost and are not suitable for real-time application. In addition, the 3D point cloud data can be converted into 2D image data through a spherical surface in the prior art, and then the characteristic of the target is extracted by adopting efficient convolution and deconvolution operation, which achieves remarkable performance on large-size objects (such as automobiles), however, the performance on small-size objects (such as pedestrians) is poor, because the method cannot extract the remarkable characteristics of the large objects and the small objects at the same time.

In order to solve the problems, the application provides a multi-scale lightweight point cloud segmentation method based on an attention mechanism, which adopts the combination of a spatial attention mechanism and a channel attention mechanism to extract semantic segmentation characteristics of targets with different sizes, and utilizes multi-scale cavity convolution to obtain context information of an overall target on a plurality of scales, so that the salient characteristics of a large object and a small object are extracted simultaneously. In order to reduce the parameters and the calculation cost, fireModule and FireDeconv (convolution and deconvolution modules) are adopted to realize a lightweight network.

After the application scenario and the design idea of the embodiment of the present application are introduced, the technical solution provided by the embodiment of the present application is described below.

As shown in fig. 1, an embodiment of the present application provides a multi-scale lightweight three-dimensional point cloud segmentation method based on self-attention, including:

step 101: converting the original three-dimensional point cloud data into a two-dimensional image through spherical transformation;

in order to efficiently process the point cloud data, the three-dimensional point cloud data is converted into two-dimensional picture data through spherical transformation.

Specifically, the method comprises the following steps:

calculating the line pixels of each point on the two-dimensional image according to the zenith angle alpha and the azimuth angle beta of the pointSum column pixel

Step 102: processing a two-dimensional image by using a multi-scale cavity convolution model which is trained in advance to obtain a first feature map;

as shown in fig. 2, the multi-scale hole convolution model includes: a first convolution layer of a 3 x 3 convolution kernel, a parallel multi-channel hole convolution unit, and a global average pooling layer, and a first adder; the multi-channel hole convolution unit comprises four parallel first hole convolution branches, a second hole convolution branch, a third hole convolution branch and a fourth hole convolution branch and a splicing unit; the first hole convolution branch comprises a second convolution layer with a convolution kernel size of 1×1 and a first hole convolution layer (Dilated Convolution) with a convolution kernel size of 3×3 and rate=1 which are connected; the second hole convolution branch comprises a first 3×3 average pooling layer and a second hole convolution layer with a convolution kernel size of 3×3 and rate=12 which are connected; the third hole convolution branch comprises a second average pooling layer of 5×5 and a third hole convolution layer with a convolution kernel size of 3×3 and rate=24 which are connected; the fourth hole convolution branch comprises a 7×7 third average pooling layer and a fourth hole convolution layer with a convolution kernel size of 3×3 and rate=36 which are connected;

the method specifically comprises the following steps:

inputting X obtained by preprocessing _input ∈R ^H×W×C Through the convolution layer with the convolution kernel size of 3 multiplied by 3, X epsilon R is output ^H ^×W×C Expressed as: x=conv _3×3 (X _input )；

Processing the feature map X by using a global average pooling layer to obtain a feature map with the size of 1 multiplied by C, and expanding the feature map into a feature map with the size of H multiplied by W multiplied by C by a broadcasting mechanismThe broadcasting mechanism can copy the number in one channel into H×W numbers;

Step 103: processing the two-dimensional image by using a pre-trained width dimension downsampling model to obtain a second feature map, a third feature map and a fourth feature map;

the width dimension downsampling model comprises a first Fire module, a second Fire module, a third convolution layer of a 1 multiplied by 1 convolution kernel, a third Fire module, a fourth convolution layer of the 1 multiplied by 1 convolution kernel, a fifth Fire module and a sixth Fire module which are connected in sequence;

using a first FireModule (FireModule) to map the first feature map Y ₁ Processing the output result of the first Fire module by using the second Fire module to obtain a second characteristic diagram Y with the size of H multiplied by W multiplied by C ₂ ；

Second feature map Y of third convolutional layer (lateral step size set to 2, longitudinal step size set to 1) pair with 1×1 convolutional kernel ₂ Processing to obtain a product with a size ofFeature map of->

Step 104: processing the two-dimensional image, the second feature map, the third feature map and the fourth feature map by utilizing the pre-trained spatial attention model to obtain a fifth feature map;

as shown in fig. 2, the spatial attention model includes: a fifth of four parallel 1 x 1 convolution kernels, a sixth of 1 x 1 convolution kernels, a seventh of 1 x 1 convolution kernels and an eighth of 1 x 1 convolution kernels, a second adder and a spatial attention module (Spatial Attention Modle);

in this embodiment, the steps include:

Step 105: processing the fourth feature map and the fifth feature map by utilizing a pre-trained width dimension up-sampling model to obtain a sixth feature map;

as shown in fig. 2, the width dimension up-sampling model includes: a double up-sampling layer, a quadruple up-sampling layer and a roll-back integration branch in parallel; the deconvolution integral branch comprises a third adder, a first Fire deconvolution layer (FireDeconv), a fourth adder, a second Fire deconvolution layer, a fifth adder, a third Fire deconvolution layer and a ninth convolution layer of a 1 multiplied by 1 convolution kernel which are sequentially connected;

Ninth convolution layer pair feature map Q using 1 x 1 convolution kernel ₈ Processing to obtain a product with a size of H×The sixth feature map Q of w×k indicates the number of classes of the division target.

Step 106: and processing the sixth feature map by using a pre-trained channel attention model (Channel Attention Model) to obtain a point cloud segmentation result.

Based on the foregoing embodiments, the embodiment of the present application provides a multi-scale lightweight three-dimensional point cloud segmentation apparatus based on self-attention, and referring to fig. 3, the multi-scale lightweight three-dimensional point cloud segmentation apparatus 200 based on self-attention provided in the embodiment of the present application at least includes:

a preprocessing unit 201, configured to convert original three-dimensional point cloud data into a two-dimensional image through spherical transformation;

a first processing unit 202, configured to process the two-dimensional image by using a multi-scale cavity convolution model that is trained in advance, so as to obtain a first feature map;

a downsampling unit 203, configured to process the two-dimensional image by using a pre-trained width dimension downsampling model, so as to obtain a second feature map, a third feature map and a fourth feature map;

a second processing unit 204, configured to process the two-dimensional image, the second feature map, the third feature map, and the fourth feature map by using the pre-trained spatial attention model, so as to obtain a fifth feature map;

an up-sampling unit 205, configured to process the fourth feature map and the fifth feature map by using a pre-trained width dimension up-sampling model, so as to obtain a sixth feature map;

the point cloud segmentation unit 206 is configured to process the sixth feature map by using the pre-trained channel attention model, so as to obtain a point cloud segmentation result.

It should be noted that, the principle of solving the technical problem of the multi-scale light three-dimensional point cloud segmentation apparatus 200 based on self-attention provided in the embodiment of the present application is similar to that of the method provided in the embodiment of the present application, so that the implementation of the multi-scale light three-dimensional point cloud segmentation apparatus 200 based on self-attention provided in the embodiment of the present application can be referred to the implementation of the method provided in the embodiment of the present application, and the repetition is omitted.

Based on the foregoing embodiments, the embodiment of the present application further provides an electronic device, as shown in fig. 4, where the electronic device 300 provided in the embodiment of the present application includes at least: the multi-scale light three-dimensional point cloud segmentation method based on self-attention provided by the embodiment of the application is realized when the processor 301 executes the computer program.

The electronic device 300 provided by the embodiments of the present application may also include a bus 303 that connects the different components, including the processor 301 and the memory 302. Bus 303 represents one or more of several types of bus structures, including a memory bus, a peripheral bus, a local bus, and so forth.

The Memory 302 may include readable media in the form of volatile Memory, such as random access Memory (Random Access Memory, RAM) 3021 and/or cache Memory 3022, and may further include Read Only Memory (ROM) 3023.

The memory 302 may also include a program tool 3025 having a set (at least one) of program modules 3024, the program modules 3024 including, but not limited to: an operating subsystem, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The electronic device 300 may also communicate with one or more external devices 304 (e.g., keyboard, remote control, etc.), one or more devices that enable a user to interact with the electronic device 300 (e.g., cell phone, computer, etc.), and/or any device that enables the electronic device 300 to communicate with one or more other electronic devices 300 (e.g., router, modem, etc.). Such communication may occur through an Input/Output (I/O) interface 305. Also, electronic device 300 may communicate with one or more networks such as a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN), and/or a public network such as the internet via network adapter 306. As shown in fig. 4, the network adapter 306 communicates with other modules of the electronic device 300 over the bus 303. It should be appreciated that although not shown in fig. 4, other hardware and/or software modules may be used in connection with electronic device 300, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, disk array (Redundant Arrays of Independent Disks, RAID) subsystems, tape drives, data backup storage subsystems, and the like.

It should be noted that the electronic device 300 shown in fig. 4 is only an example, and should not impose any limitation on the functions and application scope of the embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium, which stores computer instructions that when executed by a processor realize the multi-scale lightweight three-dimensional point cloud segmentation method based on self-attention. Specifically, the executable program may be built into or installed in the electronic device 300, so that the electronic device 300 may implement the multi-scale lightweight three-dimensional point cloud segmentation method based on self-attention provided in the embodiments of the present application by executing the built-in or installed executable program.

The self-attention-based multi-scale lightweight three-dimensional point cloud segmentation method provided by the embodiments of the present application may also be implemented as a program product comprising program code for causing the electronic device 300 to perform the self-attention-based multi-scale lightweight three-dimensional point cloud segmentation method provided by the embodiments of the present application when the program product is executable on the electronic device 300.

The program product provided by the embodiments of the present application may employ any combination of one or more readable media, where the readable media may be a readable signal medium or a readable storage medium, and the readable storage medium may be, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof, and more specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a RAM, a ROM, an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), an optical fiber, a portable compact disk read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product provided by the embodiments of the present application may be implemented as a CD-ROM and include program code that may also be run on a computing device. However, the program product provided by the embodiments of the present application is not limited thereto, and in the embodiments of the present application, the readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solution of the present application and not limiting. Although the present application has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that the modifications and equivalents may be made to the technical solutions of the present application without departing from the spirit and scope of the technical solutions of the present application, and all such modifications and equivalents are intended to be encompassed in the scope of the claims of the present application.

Claims

1. A multi-scale lightweight three-dimensional point cloud segmentation method based on self-attention is characterized by comprising the following steps:

processing the first feature map by utilizing a pre-trained width dimension downsampling model to obtain a second feature map, a third feature map and a fourth feature map;

processing the sixth feature map by using a channel attention model which is trained in advance to obtain a point cloud segmentation result;

the multi-scale cavity convolution model includes: a first convolution layer of a 3 x 3 convolution kernel, a parallel multi-channel hole convolution unit, and a global average pooling layer, and a first adder; the multi-channel hole convolution unit comprises four parallel first hole convolution branches, a second hole convolution branch, a third hole convolution branch and a fourth hole convolution branch and a splicing unit; the first cavity convolution branch comprises a second convolution layer with a convolution kernel size of 1 multiplied by 1 and a first cavity convolution layer with a convolution kernel size of 3 multiplied by 3, wherein rate=1 which are connected; the second hole convolution branch comprises a first 3×3 average pooling layer and a second hole convolution layer with a convolution kernel size of 3×3 and rate=12 which are connected; the third hole convolution branch comprises a second average pooling layer of 5×5 and a third hole convolution layer with a convolution kernel size of 3×3 and rate=24 which are connected; the fourth hole convolution branch comprises a 7×7 third average pooling layer and a fourth hole convolution layer with a convolution kernel size of 3×3 and rate=36 which are connected;

processing the feature image X by using a first cavity convolution branch to obtain a feature image X with the size ofFeature map of->

Processing the feature map X by using a second cavity convolution branch to obtain a feature map X with the size ofFeature map of->

Processing the feature map X by using a third cavity convolution branch to obtain a feature map X with the size ofFeature map of->

Processing the feature image X by using a fourth cavity convolution branch to obtain a feature image X with the size ofFeature map of->

2. The method of claim 1, wherein the original three-dimensional point cloud data is converted into a two-dimensional image by spherical transformation; comprising the following steps:

3. The method of claim 2, wherein the width dimension downsampling model comprises a first Fire module, a second Fire module, a third convolution layer of a 1 x 1 convolution kernel, a third Fire module, a fourth convolution layer of a 1 x 1 convolution kernel, a fifth Fire module, and a sixth Fire module connected in sequence;

processing the first feature map by utilizing a pre-trained width dimension downsampling model to obtain a second feature map, a third feature map and a fourth feature map; comprising the following steps:

first feature map Y by using first Fire module ₁ Processing the output result of the first Fire module by using the second Fire module to obtain a second characteristic diagram Y with the size of H multiplied by W multiplied by C ₂ ；

4. A method according to claim 3, wherein the spatial attention model comprises: a fifth convolution layer of four parallel 1 x 1 convolution kernels, a sixth convolution layer of 1 x 1 convolution kernels, a seventh convolution layer of 1 x 1 convolution kernels and an eighth convolution layer of 1 x 1 convolution kernels, a second adder and a spatial attention module;

5. The method of claim 4, wherein the width-dimensional upsampling model comprises: a double up-sampling layer, a quadruple up-sampling layer and a roll-back integration branch in parallel; the deconvolution integral branch comprises a third adder, a first Fire deconvolution layer, a fourth adder, a second Fire deconvolution layer, a fifth adder, a third Fire deconvolution layer and a ninth convolution layer of a 1 multiplied by 1 convolution kernel which are sequentially connected;

Feature map Q using a first Fire deconvolution layer ₁ Processing to obtain a product with a size ofFeature map Q of (2) ₂ ；

Fifth feature map with quadruple upsampling layerZ is processed to obtain a characteristic diagram Q with the size of H multiplied by W multiplied by C ₆ ；

6. The method of claim 5, wherein the method further comprises: and performing joint training on the multi-scale cavity convolution model, the width dimension downsampling model, the spatial attention model, the width dimension upsampling model and the channel attention model.

7. A multi-scale lightweight three-dimensional point cloud segmentation apparatus based on self-attention, comprising:

the downsampling unit is used for processing the first feature map by utilizing a pre-trained width dimension downsampling model to obtain a second feature map, a third feature map and a fourth feature map;

the point cloud segmentation unit is used for processing the sixth feature map by utilizing the channel attention model which is trained in advance to obtain a point cloud segmentation result;

the first processing unit is specifically configured to:

8. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of claims 1-6 when the computer program is executed.

9. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1-6.