CN110032962B

CN110032962B - Object detection method, device, network equipment and storage medium

Info

Publication number: CN110032962B
Application number: CN201910267019.5A
Authority: CN
Inventors: 杨泽同; 孙亚楠; 賈佳亞; 戴宇榮; 沈小勇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2022-07-08
Anticipated expiration: 2039-04-03
Also published as: CN110032962A; WO2020199834A1

Abstract

The embodiment of the invention discloses an object detection method, an object detection device, network equipment and a storage medium; according to the embodiment of the invention, the foreground points can be detected from the point cloud of the scene; constructing an object region corresponding to the foreground point based on the foreground point and the preset size to obtain initial positioning information of the candidate object region; extracting features of all points in the point cloud based on the point cloud network to obtain a feature set corresponding to the point cloud; constructing regional characteristic information of the candidate object region based on the characteristic set; predicting the type and the positioning information of the candidate object region based on the region prediction network and the region characteristic information to obtain the predicted type and the predicted positioning information of the candidate object region; and optimizing the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region and the predicted positioning information to obtain an optimized object detection region and positioning information of the optimized object detection region. The scheme can improve the accuracy of object detection.

Description

Object detection method, device, network equipment and storage medium

Technical Field

The invention relates to the technical field of images, in particular to an object detection method, an object detection device, network equipment and a storage medium.

Background

Object detection refers to determining the location, type, etc. of an object in a scene. At present, the object detection technology is widely applied to various scenes, such as automatic driving, unmanned aerial vehicles and the like.

The current object detection schemes are all to collect scene images, extract features from the scene images, and then determine the location and category in the scene based on the extracted features. However, the practical target object detection scheme at present has the problems of low object detection accuracy and the like, especially in a 3D object detection scene.

Disclosure of Invention

The embodiment of the invention provides an object detection method, an object detection device, network equipment and a storage medium, which can improve the accuracy of object detection.

The embodiment of the invention provides an object detection method, which comprises the following steps:

detecting foreground points from the point cloud of the scene;

constructing an object region corresponding to the foreground point based on the foreground point and a preset size to obtain initial positioning information of the candidate object region;

extracting features of all points in the point cloud based on a point cloud network to obtain a feature set corresponding to the point cloud;

constructing regional characteristic information of the candidate object region based on the characteristic set;

predicting the type and the positioning information of the candidate object area based on an area prediction network and the area characteristic information to obtain the predicted type and the predicted positioning information of the candidate object area;

and optimizing the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region and the predicted positioning information to obtain an optimized object detection region and positioning information of the optimized object detection region.

Correspondingly, an embodiment of the present invention further provides an object detection apparatus, including:

the detection unit is used for detecting foreground points from the point cloud of the scene;

the area construction unit is used for constructing an object area corresponding to the foreground point based on the foreground point and a preset size to obtain initial positioning information of a candidate object area;

the characteristic extraction unit is used for extracting the characteristics of all points in the point cloud based on a point cloud network to obtain a characteristic set corresponding to the point cloud;

the feature construction unit is used for constructing regional feature information of the candidate object region based on the feature set;

the prediction unit is used for predicting the type and the positioning information of the candidate object area based on the area prediction network and the area characteristic information to obtain the predicted type and the predicted positioning information of the candidate object area;

and the optimization unit is used for optimizing the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region and the prediction positioning information to obtain an optimized object detection region and positioning information of the optimized object detection region.

In an embodiment, the detection unit is configured to perform semantic segmentation on an image of a scene to obtain a foreground pixel; and determining a point corresponding to the foreground pixel in the point cloud of the scene as a foreground point.

In an embodiment, the region construction unit is specifically configured to generate an object region corresponding to the foreground point by using the foreground point as a central point and according to a predetermined size.

In an embodiment, the feature construction unit specifically includes:

a selection subunit configured to select a plurality of target points in the object candidate region;

an extracting subunit, configured to extract a feature of the target point from the feature set, to obtain first partial feature information of the candidate object region;

a construction subunit configured to construct second partial feature information of the object candidate region based on the position information of the target point;

and the fusion subunit is configured to fuse the first part of feature information and the second part of feature information to obtain the region feature information of the candidate region.

In an embodiment, the feature constructing unit may specifically include:

a construction subunit, configured to construct second partial feature information of the object candidate region based on the position information of the target point;

In an embodiment, the building subunit is specifically configured to:

standardizing the position information of the target point to obtain standardized position information of the target point;

fusing the first part of feature information and the standardized position information to obtain fused feature information of a target point;

carrying out spatial transformation on the fused feature information of the target to obtain transformed position information;

and adjusting the standardized position information of the target point based on the transformed position information to obtain second part feature information of the candidate object region.

In one embodiment, the point cloud network comprises: the device comprises a first sampling network and a second sampling network connected with the first sampling network; the feature extraction unit may include:

the down-sampling sub-unit is used for performing feature down-sampling operation on all points in the point cloud through the first sampling network to obtain initial features of the point cloud;

and the up-sampling subunit is used for performing up-sampling operation on the initial features through the second sampling network to obtain a feature set of the point cloud.

In one embodiment, the first sampling network comprises a plurality of collection abstraction layers which are connected in sequence, and the second sampling network comprises a plurality of feature propagation layers which are connected in sequence and correspond to the collection abstraction layers;

a down-sampling sub-unit, specifically configured to:

sequentially carrying out local area division on points in the point cloud through the set abstraction layer, and extracting the characteristics of the central points of the local areas to obtain the initial characteristics of the point cloud;

inputting initial features of the point cloud to a second sampling network;

the upsampling subunit is specifically configured to:

determining the output characteristics of the previous layer and the output characteristics of the set abstraction layer corresponding to the current characteristic propagation layer as the current input characteristics of the current characteristic propagation layer;

and performing up-sampling operation on the current input features through the current feature propagation layer to obtain a feature set of the point cloud.

In one embodiment, the regional prediction network comprises a feature extraction network, a classification network connected to the sampling network, and a regression network connected to the feature extraction network;

the prediction unit specifically includes:

the global feature extraction subunit is used for performing feature extraction on the region feature information through the feature extraction network to obtain global feature information of the candidate object region;

the classification subunit is configured to classify the candidate object region based on the classification network and the global feature information to obtain a prediction type of the candidate region;

and the regression subunit is used for positioning the candidate object region based on the regression network and the global feature information to obtain the predicted positioning information of the candidate region.

In one embodiment, the feature extraction network comprises: the system comprises a plurality of sequentially connected set abstraction layers, a classification network and a regression network, wherein the classification network comprises a plurality of sequentially connected full connection layers, and the regression network comprises a plurality of sequentially connected full connection layers;

and the global feature extraction subunit is used for sequentially performing feature extraction on the regional feature information through a set abstraction layer in the feature extraction network to obtain global feature information of the candidate object region.

In an embodiment, the optimization unit may specifically include:

the screening subunit is used for screening the candidate object region based on the prediction type of the candidate object region to obtain a screened object region;

and the optimizing subunit is used for optimizing and adjusting the initial positioning information of the screened object region according to the predicted positioning information of the screened object region to obtain an optimized object detection region and positioning information thereof.

The embodiment of the invention also provides network equipment, which comprises a memory and a processor; the memory stores a plurality of instructions, and the processor loads the instructions in the memory to execute the steps in any one of the object detection methods provided by the embodiments of the present invention.

In addition, the embodiment of the present invention further provides a storage medium, where the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform any of the steps in the object detection method provided in the embodiment of the present invention.

According to the embodiment of the invention, the foreground points can be detected from the point cloud of the scene; constructing an object region corresponding to the foreground point based on the foreground point and a preset size to obtain initial positioning information of the candidate object region; extracting features of all points in the point cloud based on a point cloud network to obtain a feature set corresponding to the point cloud; constructing regional characteristic information of the candidate object region based on the characteristic set; predicting the type and the positioning information of the candidate object region based on a region prediction network and the region characteristic information to obtain the predicted type and the predicted positioning information of the candidate object region; and optimizing the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region and the predicted positioning information to obtain an optimized object detection region and positioning information of the optimized object detection region. According to the scheme, the point cloud data of the scene can be adopted for object detection, the candidate detection area can be generated for each point, and the candidate detection area is optimized based on the area characteristics of the candidate detection area; therefore, the accuracy of object detection can be greatly improved, and the method is particularly suitable for 3D object detection.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a schematic view of a scene of an object detection method according to an embodiment of the present invention;

FIG. 1b is a flow chart of an object detection method provided by an embodiment of the invention;

FIG. 1c is a schematic structural diagram of a point cloud network according to an embodiment of the present invention;

fig. 1d is a schematic diagram of a PointNet + + network structure according to an embodiment of the present invention;

FIG. 1e is a schematic diagram illustrating an object detection effect in an automatic driving scene according to an embodiment of the present invention;

FIG. 2a is a schematic diagram of semantic segmentation of an image according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of point cloud segmentation provided by an embodiment of the present invention;

FIG. 2c is a schematic diagram illustrating candidate region generation according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of candidate region feature construction according to an embodiment of the present invention;

FIG. 4a is a schematic structural diagram of a regional prediction network according to an embodiment of the present invention

Fig. 4b is a schematic structural diagram of a regional prediction network according to an embodiment of the present invention;

FIG. 5a is a schematic flow chart of object detection provided by the embodiment of the present invention;

FIG. 5b is an architectural diagram of object detection provided by an embodiment of the present invention;

FIG. 5c is a schematic diagram of the results of a test experiment provided by an embodiment of the present invention;

FIG. 6a is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention;

FIG. 6b is a schematic structural diagram of an object detecting device according to an embodiment of the present invention;

FIG. 6c is a schematic structural diagram of an object detecting device according to an embodiment of the present invention;

FIG. 6d is a schematic structural diagram of an object detecting device according to an embodiment of the present invention;

FIG. 6e is a schematic structural diagram of an object detecting device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a network device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an object detection method, an object detection device, network equipment and a storage medium. The object detection device can be integrated in a network device, and the network device can be a server or a terminal; for example, the network device may include a vehicle-mounted device, a mini-box, and the like.

By object detection, it may refer to determining or identifying a location, a category, and the like of an object in a certain scene, for example, identifying a category and a location of an object in a certain road scene, such as a street lamp, a vehicle, and a location thereof.

Referring to fig. 1a, an object detection system provided by an embodiment of the present invention includes a network device, a collection device, and the like; the network device is connected with the acquisition device, for example, through a wired or wireless network connection. In one embodiment, the network device and the acquisition device may be integrated into one device.

The acquisition equipment can be used for acquiring point cloud data or image data and the like of a scene, and the acquisition equipment can upload the acquired point cloud data to the network equipment for processing in a real-time exchange rate.

The network equipment can be used for object detection, and specifically can detect foreground points from point clouds of a scene; constructing an object region corresponding to the foreground point based on the foreground point and the preset size to obtain initial positioning information of the candidate object region; extracting features of all points in the point cloud based on the point cloud network to obtain a feature set corresponding to the point cloud; constructing regional characteristic information of the candidate object region based on the characteristic set; predicting the type and the positioning information of the candidate object region based on the region prediction network and the region characteristic information to obtain the predicted type and the predicted positioning information of the candidate object region; and optimizing the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region and the predicted positioning information to obtain an optimized object detection region and positioning information thereof. In practical applications, after obtaining the position information of the optimized object detection, the detected object may be identified in the scene image according to the position information, for example, the detected object is framed in the image in a detection frame manner, and in an embodiment, the type of the detected object may also be identified in the scene image.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

In this embodiment, the object detection apparatus will be described in terms of an object detection apparatus, which may be specifically integrated in a network device, where the network device may be a server or a terminal; the terminal may include a mobile phone, a tablet Computer, a notebook Computer, a Personal Computer (PC), a microprocessor terminal, and the like.

An object detection method provided in an embodiment of the present invention may be executed by a processor of a network device, and as shown in fig. 1b, a specific process of the object detection method may be as follows:

101. foreground points are detected from a point cloud of a scene.

The point cloud is a point set of surface characteristics of a scene or a target, and the points in the point cloud may include position information of the points, such as three-dimensional coordinates, and may further include color information (RGB) or reflection Intensity information (Intensity).

The point cloud may be detected by a laser measurement principle or a photogrammetry principle, for example, the point cloud of the object may be scanned by a laser scanner or a photographic scanner. The principle of laser detection point cloud is as follows: when a beam of laser irradiates the surface of an object, the reflected laser carries information such as direction, distance and the like. When the laser beam is scanned along a certain trajectory, the reflected laser spot information is recorded while scanning, and since the scanning is extremely fine, a large number of laser spots can be obtained, and a laser point cloud can be formed. The point cloud format is las; *. pcd; *. txt, and the like.

In the embodiment of the invention, the point cloud data of the scene can be acquired by the network equipment, or acquired by other equipment, and the network equipment acquires the point cloud data from other equipment, or searches from a network database, and the like.

The scene can be various, for example, a road scene in automatic driving, an aviation scene in flight of the unmanned aerial vehicle, and the like.

The foreground points are foreground points relative to background points, a scene can be divided into a background and a foreground, the points of the background can be called background points, and the points of the foreground can be called foreground points. According to the embodiment of the invention, the foreground points in the scene point cloud can be identified by performing semantic segmentation on the point cloud of the scene.

In the embodiment of the invention, the foreground points can be detected from the point clouds in various ways, for example, the point clouds of a scene can be directly subjected to semantic segmentation to obtain the foreground points in the point clouds.

Among them, Semantic Segmentation (Semantic Segmentation) can refer to: each point in a scene is classified to identify a certain type of point.

The semantic segmentation mode may be various, for example, 2D semantic segmentation or 3D semantic segmentation may be adopted to perform semantic segmentation on the point cloud.

For another example, in order to detect more foreground points and improve the detection reliability and accuracy of the foreground points, in an embodiment, the image of the scene may be subjected to semantic segmentation to obtain foreground pixels, and then the foreground pixels are mapped to the point cloud to obtain the foreground points. Specifically, the step of "detecting foreground points from a point cloud of a scene" may include:

performing semantic segmentation on the image of the scene to obtain foreground pixels;

and determining a point corresponding to the foreground pixel in the point cloud of the scene as a foreground point. For example, the foreground pixel may be mapped to a point cloud of the scene to obtain a target point corresponding to the foreground pixel in the point cloud (for example, mapping may be implemented based on a mapping relationship between pixels in the image and a midpoint of the point cloud, such as a position mapping relationship), and the target point is determined as the foreground point.

In an embodiment, a point in the point cloud may be projected into an image of a scene, for example, the point may be projected into the image of the scene through a mapping relationship matrix or a transformation matrix between the point cloud and a pixel, then, a segmentation result (such as a foreground pixel, a background pixel, and the like) in the image corresponding to the point is used as a segmentation result of the point, and a foreground point is determined from the point cloud based on the segmentation result corresponding to the point, specifically, when the segmentation result of the point is a foreground pixel, the point is determined to be a foreground point.

In order to improve the accuracy of semantic segmentation, the semantic segmentation in the embodiment of the present invention may be implemented by a segmentation network based on deep learning, for example, a de eplab v3 based on X-convergence may be used as the segmentation network, and the image of the scene is segmented by the segmentation network to obtain foreground pixels such as foreground pixels of a vehicle, a pedestrian, and a riding person in automatic driving. Then, the points in the point cloud are projected into the image of the scene, and then the segmentation result in the corresponding picture is used as the segmentation result of the point, so as to generate the foreground points in the point cloud. The method can accurately detect the foreground point in the point cloud.

102. And constructing an object region corresponding to the foreground point based on the foreground point and the preset size to obtain the initial positioning information of the candidate object region.

After the foreground points are obtained, the embodiment of the invention can construct the object area corresponding to each foreground point based on the foreground points and the preset size, and the object area corresponding to the foreground point is used as the candidate object area.

The object region may be a two-dimensional region, i.e., a 2D region, or a three-dimensional region, i.e., a 3D region, and may be specifically determined according to actual requirements. The predetermined size may be set according to actual requirements, and the predetermined size may include predetermined size parameters, for example, length l × width W is included in the 2D region, and length l × width W × height h is included in the 3D region.

For example, to improve the accuracy of object detection, an object region corresponding to a foreground point may be generated by taking the foreground point as a central point and according to a predetermined size.

The positioning information of the candidate object region may include position information, size information, and the like of the candidate object region.

For example, in an embodiment, to facilitate calculation of the lifting object detection, the position information of the candidate object region may be represented by position information of a reference point in the region, that is, the position information of the candidate object region may include position information of the reference point in the candidate object region, and the reference point may be set according to an actual requirement, for example, the position information of a center point of the candidate object region. For example, taking a three-dimensional region as an example, the position information of the candidate object region may include 3D coordinates of a center point such as (x, y, z).

The size information of the candidate region may include size parameters of the region, for example, the candidate region is a 2D region including a length l × a width W, and the candidate region is a 3D region including a length l × a width W × a height h.

Furthermore, in some scenes, the orientation of the object is also the more important reference information, so in some embodiments, the positioning information of the candidate object region may also include the orientation of the candidate object region, such as forward, backward, downward, upward, etc., which indicates the orientation of the object in the scene, and in some scenes, the orientation of the object is also the more important information. In practice, the orientation of the region may be expressed based on an angle, for example, two orientations, 0 ° and 90 °, may be defined.

In practical applications, for convenience of object detection and user observation, the object region may be identified in the form of a frame, for example, a 2D detection frame or a 3D detection frame, where the detection frame represents the object region and the candidate detection frame represents the candidate object region.

For example, taking a driving road scene as an example, referring to fig. 2a, semantic segmentation may be performed on an image by using a 2D segmentation network to obtain an image segmentation result (including foreground pixels, etc.); then, referring to fig. 2b, the image segmentation result is mapped into the point cloud, so as to obtain a point cloud segmentation result (including the foreground point). Then, a candidate area is generated with each foreground point as a center. The candidate region generation diagram is shown in fig. 2 c. A 3D detection frame of an artificially specified size is generated as a candidate area with each point as the center. The candidate region is represented by (x, y, z, l, h, w, angle), where x, y, z represent the 3D coordinates of the center point, and l, h, w are the length, height, and width of the candidate region that we set. In practical experiments, l is 3.8, h is 1.6, and w is 1.5. angle represents the orientation of the 3D candidate region, and when generating the candidate region, the embodiment of the present invention adopts two orientations, which are 0 ° and 90 °, respectively.

Through the steps, the embodiment of the invention can generate a candidate object area, such as a 3D candidate object detection frame, for each foreground point.

103. And extracting the features of all the points in the point cloud based on the point cloud network to obtain a feature set corresponding to the point cloud.

The point cloud network may be a deep learning-based network, for example, a point cloud network such as PointNet and PointNet + +. In the embodiment of the present invention, the timing sequence between step 103 and step 102 is not limited by the sequence number, and step 103 may be executed before step 102, or may be executed simultaneously.

Specifically, all points in the point cloud may be input to a point cloud network, and the point cloud network performs feature extraction on the input points to obtain a feature set of the point cloud.

Taking PointNet + + as an example to describe the point cloud network, as shown in fig. 1c, the point cloud network may include a first sampling network and a second sampling network; the first sampling network is connected with the second sampling network. In practice the first sampling network may be referred to as an encoder and the second sampling network may become a decoder. Specifically, performing feature downsampling operation on all points in the point cloud through a first sampling network to obtain initial features of the point cloud; and performing up-sampling operation on the initial features through a second sampling network to obtain a feature set of the point cloud.

Referring to fig. 1d, the first sampling network includes a plurality of sequentially connected set abstraction layers (SA, set)

abstruction), the second sampling network includes a plurality of feature propagation layers (FP) connected in sequence and corresponding to the set abstraction layer (SA). The number of the SAs and the FPs can be set according to actual requirements, for example, three layers of SAs and FPs are included.

Referring to fig. 1d, the first sampling network may comprise three downsampling operations (i.e. the encoding phase comprises three downsampling operations), the number of points being 1024, 256, 64; the second sampling network may comprise three upsampling operations (i.e. the decoding phase comprises three upsampling operations), with the number of three steps being 256, 1024, N. The point cloud network characteristic extraction process is as follows:

inputting all points of the point cloud into a first sampling network, sequentially carrying out local area division on the points in the point cloud through a set abstraction layer (SA) in the first sampling network, and extracting the characteristics of the central points of the local areas to obtain the initial characteristics of the point cloud; for example, referring to fig. 1d, after input is point cloud N × 4 and three layers of SA downsampling operations are performed, the feature of the output point cloud is 64 × 1024.

In the embodiment of the invention, the pointet + + uses the idea of hierarchical extraction features, and each time is called set iteration. The method comprises the following three parts: a sampling layer, a grouping layer and a feature extraction layer. First, looking at the sampling layer, in order to extract some relatively important central points from the dense point cloud, a fps (fast point sampling) farthest point sampling method is adopted, and these points do not necessarily have semantic information. Of course, random sampling is also possible; then, a grouping layer is used, and the nearest k neighbor points are searched in a certain range of the central point extracted from the upper layer to form a patch; the feature extraction layer takes the features of the k points obtained by volume and pooling through a small-sized pointet network as the features of the central point, and then sends the central point to the next layer for continuation. Thus, the center points obtained by each layer are subsets of the center points of the previous layer, and as the number of the layers is increased, the number of the center points is less, but each center point contains more and more information.

In light of the above description, the first sampling network in the embodiment of the present invention is composed of a plurality of SA layers, and at each level, a set of points is processed and abstracted to generate a new set with fewer elements. The collection abstraction layer consists of three key layers: sampling layer (Sampling layer), Grouping layer (Grouping layer), point cloud network layer (PointNet layer). The sampling layer selects a set of points from the input points that define the centroid of the local region. The grouping layer constructs a set of local regions by finding "neighboring" points around the centroid. The pointet layer uses a network of micro-dots to encode local area patterns into feature vectors.

In an embodiment, considering that the actual point cloud is rarely evenly distributed, at the time of sampling, for dense areas, small-scale sampling should be used to get deep into fine features (fine details), but in sparse areas, large-scale sampling should be used because too small a scale would result in insufficient sampling at sparseness. Accordingly, embodiments of the present invention provide an improved SA layer. Specifically, a Grouping layer (MSG) in the SA layer may use Multi-scale Grouping, in which local features at each radius are extracted and then combined together. The idea is to sample the multi-scale features in the grouping layer, concat (concatenation). For example, referring to fig. 1d, MSG packets are used in the first, second layer SA layer.

Furthermore, in an embodiment, in order to improve the robustness of the sampling density variation, a Single Scale Group (SSG) may be used in the SA, for example, at the SA layer as an output.

After the first sampling network outputs the output features of the point cloud, the initial features of the point cloud may be input to a second sampling network, through which the initial features are upsampled, such as a residual upsampling operation. For example, referring to fig. 1d, after up-sampling 64 × 1024 features through three layers of FPs of the second sampling network, N × 128 features are output.

In one implementation, in order to improve the prevention of the gradient change or the loss of the features, the features of the output of each SA layer in the first sampling network may be further considered when the second sampling network performs the upsampling operation. Specifically, the step of performing upsampling operation on the initial feature through a second sampling network to obtain a feature set of the point cloud includes:

For example, referring to fig. 1d, when inputting 64 × 1024 point cloud features to the first FP layer, the first FP layer determines the 64 × 1024 point cloud features and 256 × 256 features output by the third SA layer as current input features, performs an upsampling operation on the features, and outputs the obtained features to the second FP layer. The second FP layer takes the output characteristics 256 × 128 characteristics of the previous FP layer and the 1024 × 128 characteristics output by the first SA layer as the input characteristics of the current layer, and performs an upsampling operation on the characteristics to obtain 1024 × 128 characteristic input values of the third FP layer. And the third FP layer takes the 1024 × 128 features output by the second FP layer and the originally input N × 4 features as the input features of the current layer, and performs up-sampling operation to output the final features of the point cloud.

Through the steps, feature extraction can be performed on all points in the point cloud to obtain the feature set of the point cloud, information loss is prevented, and accuracy of object detection is improved.

104. And constructing regional characteristic information of the candidate object region based on the characteristic set.

The method for constructing the feature information of the candidate object region based on the point cloud feature set can be various, for example, the features of some points can be selected from the feature set to serve as the feature information of the region to which the points belong; for another example, the position information of some points may be selected from the feature set as the feature information of the region.

For another example, in order to improve the extraction accuracy of the region features, the feature and position information of some points may be collected to construct region feature information. Specifically, the step of "constructing the regional characteristic information of the candidate object region based on the feature set" may include:

selecting a plurality of target points in the candidate object area;

extracting the features of the target point from the feature set to obtain first part feature information of the candidate object region;

constructing a second partial feature of the candidate object region based on the position information of the target point;

and fusing the first part of feature information and the second part of feature information to obtain the region features of the candidate region.

The number of the target points and the selection mode may be set according to actual requirements, for example, a certain number of, for example, 512 points may be selected randomly in the candidate object region or according to a certain selection mode (e.g., selection based on a distance from the central point, etc.).

After selecting the target point from the candidate object region, the feature of the target point may be extracted from the feature set of the point cloud, and the feature of the extracted target point serves as the first partial feature information of the candidate object region (which may be denoted by F1). For example, after 512 points are randomly selected, 512 points of feature composition first part feature information F1 may be extracted from a feature map (i.e., a feature set) of the point cloud.

For example, referring to fig. 3, feature groups F1(B, M, C) including, for example, 512 target points in the crop candidate region from the feature groups (B, N, C) of the point cloud, where M is the number of target points, e.g., M is 512, where N is the number of points in the point cloud.

There are various ways to construct the second partial feature of the area based on the position information of the target point, for example, the position information of the target point may be directly used as the second partial feature information of the area (which may be denoted by F2). For another example, in order to improve the accuracy of extracting the position feature, the position information may be transformed to construct the second partial feature of the region. For example, the step of "constructing the second partial feature information of the object candidate region based on the position information of the target point" may include:

(1) and standardizing the position information of the target point to obtain standardized position information of the target point.

The position information of the target point may include coordinate information of the target point, such as a 3D coordinate xyz, and the normalization process (normalization) of the position information may be set according to an actual requirement, for example, the position information of the target point may be adjusted based on the position information of the center point of the area. Such as subtracting the 3D coordinates of the center of the area from the 3D coordinates of the target point, etc.

(2) And fusing the first part of feature information and the standardized position information to obtain fused feature information of the target point.

For example, referring to fig. 3, normalized position information (e.g., 3D coordinates xyz) of 512 points may be fused with the first partial feature F1, and specifically, the normalized position information and the first partial feature may be fused in a Concat manner to obtain a fused feature (B, N, C + 3).

(3) And carrying out spatial transformation on the fused feature information of the target to obtain transformed position information of the target point.

In order to further improve the extraction accuracy of the second partial features, spatial transformation can be performed on the fused features.

For example, in one embodiment, a Spatial Transform Network (STN) may be used for the transformation, such as a supervised spatial transform network such as T-Net. Referring to fig. 3, the fused feature (B, N, C +3) can be spatially transformed by T-Net to obtain transformed coordinates (B, 3).

(4) And adjusting the standardized position information of the target point based on the transformed position information to obtain second part characteristic information of the candidate object region.

For example, the transformed position value may be subtracted from the normalized position value of the target point to obtain the second partial feature F2 of the candidate object region. Referring to fig. 3, the transformed 3D coordinates (B, 3) may be subtracted from the normalized (normaize) target point 3D coordinates (B, N, 3) to obtain a second partial feature F2.

Because the feature is subjected to spatial transformation, after the transformed position is subtracted from the position feature, the geometric stability or spatial invariance of the position feature can be improved, and the accuracy of feature extraction is improved.

The first part of feature information and the second part of feature information of each candidate object region can be obtained through the method, and then the two parts of features are fused to obtain the region feature information of each candidate object region. For example, referring to fig. 3, F1 and F2 may be connected (Concat) to obtain a connected feature (B, N, C +3) of the object candidate region, and the feature may be used as the region feature of the object candidate region.

105. And predicting the type and the positioning information of the candidate object region based on the region prediction network and the region characteristic information to obtain the predicted type and the predicted positioning information of the candidate object region.

The area prediction network may be a deep learning-based area prediction network, and may be trained from point clouds or images of sample objects.

The predicted positioning information may include predicted position information such as 2D or 3D coordinates, dimensions such as length, width, etc., and in an embodiment, predicted orientation information such as 0 ° or 90 °.

Referring to fig. 4a, the regional prediction network may include a feature extraction network, a classification network, and a regression network, and the classification network and the regression network are respectively connected to the feature extraction network. The following were used:

the feature extraction network is configured to perform feature information on the input information, for example, perform feature extraction on regional feature information of the candidate object region to obtain global feature information of the candidate object region.

The classification network is configured to classify the regions, for example, the candidate object regions may be classified based on the global feature information of the candidate object regions, so as to obtain the prediction types of the candidate object regions.

The regression network is used to locate the region, for example, to locate the candidate object region to obtain the predicted location information of the candidate object region. Since the location is predicted using a regression network, the output predicted location information may also be referred to as regression information, such as predicted regression information.

For example, the step of "predicting the type and the positioning information of the candidate object region based on the region prediction network and the region feature information to obtain the predicted type and the predicted positioning information of the candidate object region" may include:

extracting the characteristics of the regional characteristic information through a characteristic extraction network to obtain the global characteristic information of the candidate object region;

classifying the candidate object region based on the classification network and the global feature information to obtain the prediction type of the candidate region;

and positioning the candidate object region based on the regression network and the global characteristic information to obtain the predicted positioning information of the candidate region.

In order to improve the accuracy of prediction, referring to fig. 4b, the feature extraction network in the embodiment of the present invention may include: a plurality of sequentially connected collection abstraction layers (SA layers); the classification network may comprise a plurality of fully connected layers (fc) connected in sequence, as shown in fig. 4b, comprising a plurality of fcs for classification, such as cls-fc1, cls-fc2, cls-pred. Wherein the regression network comprises a plurality of fully connected layers connected in sequence, as shown in fig. 4b, and comprises a plurality of fcs for regression, such as reg-fc1, reg-fc2, and reg-pred. In the embodiment of the present invention, the number of SA layers and fc layers can be set according to actual requirements.

In this embodiment of the present invention, the process of extracting global feature information of a region may include: and sequentially carrying out feature extraction on the regional feature information through a set abstraction layer in the feature extraction network to obtain the global feature information of the candidate object region.

In an embodiment, the packets in the SA layer may be grouped in a single scale manner, that is, SSG grouping is adopted, so as to improve accuracy and efficiency of global feature extraction.

Referring to fig. 4b, the area prediction network may perform feature extraction on the area feature information sequentially through three SA layers, for example, when the input feature input is an M × 131 feature, the three SA layers perform feature extraction to obtain features of 128 × 128, 32 × 256, and the like, respectively. After the SA layer features are extracted, global features are obtained, and at this time, the global features may be input to the classification network and the regression network, respectively.

The classification network carries out dimensionality reduction on the features through the first two cls-fc1 and cls-fc2, carries out classification prediction on the last cls-pred layer, and outputs the prediction type of the region.

The regression network carries out dimensionality reduction on the features through the first two reg-fc1 and reg-fc2, and regression prediction is carried out on the last reg-pred layer to obtain the prediction positioning information of the region.

The type of the region can be set according to actual requirements, for example, whether an object exists in the region can be classified as an object or not; or the division by quality may also be divided into high quality, medium and low.

The type and positioning information of each candidate object region can be predicted through the steps.

106. And optimizing the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region and the predicted positioning information to obtain an optimized object detection region and positioning information of the optimized object detection region.

The optimization method may be various, for example, the positioning information of the candidate object region may be adjusted based on the predicted positioning information, and then the candidate object region may be screened based on the prediction type. For another example, in one embodiment, the selection area may be filtered based on the prediction type, and then the positioning information may be adjusted.

For example, the step of "performing optimization processing on the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region, and the predicted positioning information to obtain the optimized object detection region and the positioning information thereof" may include:

screening the candidate object region based on the prediction type of the candidate object region to obtain a screened object region;

and optimizing and adjusting the initial positioning information of the screened object area according to the predicted positioning information of the screened object area to obtain an optimized object detection area and positioning information thereof.

For example, when the prediction type includes an object region and a null region, the null region in which the object candidate region does not include an object may be filtered, and then the positioning information of the filtered region may be optimally adjusted based on the predicted positioning information.

Specifically, the positioning information optimization adjustment manner may be adjusted based on difference information between the predicted positioning information and the initial positioning information, such as a difference value of 3D coordinates of the area, a size difference value, and the like.

For another example, an optimal positioning information may be determined based on the predicted positioning information and the initial positioning information, and then the positioning information of the candidate object region may be adjusted to the optimal positioning information. For example, an optimal area 3d coordinate and length and width are determined.

In practical application, the object detection area may be identified in the scene image based on the optimized positioning information of the object detection area, for example, referring to fig. 1e, the object detection method provided by the embodiment of the present invention may accurately detect the position, size, and direction of the object on the current road in the automatic driving scene, which is beneficial to decision and judgment of automatic driving.

The object detection provided by the embodiment of the invention can be suitable for various scenes, such as automatic driving, unmanned aerial vehicles, safety monitoring and the like.

As can be seen from the above, the embodiment of the invention can detect the foreground point from the point cloud of the scene; constructing an object region corresponding to the foreground point based on the foreground point and the preset size to obtain initial positioning information of the candidate object region; extracting features of all points in the point cloud based on the point cloud network to obtain a feature set corresponding to the point cloud; constructing regional characteristic information of the candidate object region based on the characteristic set; predicting the type and the positioning information of the candidate object region based on the region prediction network and the region characteristic information to obtain the predicted type and the predicted positioning information of the candidate object region; the candidate object area is optimized based on the initial positioning information of the candidate area, the prediction type of the candidate object area and the prediction positioning information, and the optimized object detection area and the positioning information of the optimized object detection area are obtained.

In addition, the scheme can also generate a candidate detection area for each point, so that information loss can be avoided, and meanwhile, the candidate area is generated for each foreground point, namely, the candidate area corresponding to any object can be generated for any object, so that the influence of object dimension change and serious shielding can be avoided, and the effectiveness and the success rate of object detection are improved.

In addition, the scheme can also optimize the candidate detection area based on the area characteristics of the candidate detection area; therefore, the accuracy and quality of object detection can be further improved.

The method described in the above examples is further illustrated in detail below by way of example.

In this embodiment, an example in which the object detection apparatus is specifically integrated in a network device will be described.

Respectively training a semantic segmentation network, a point cloud network and a regional prediction network, wherein the training method specifically comprises the following steps:

1. and training a semantic segmentation network.

First, a network device may obtain a training set of a semantic segmentation network, where the training set includes sample images labeled with pixel types (e.g., foreground pixels, background pixels, etc.).

The network device may train the semantic segmentation based on a training set and a loss function. Specifically, semantic segmentation can be performed on the sample image through a semantic segmentation network to obtain foreground pixels of the sample image, and then the pixel types obtained through segmentation and the labeled pixel types are converged based on a loss function to obtain a trained semantic segmentation network.

2. And (5) training a point cloud network.

A network device obtains a training set of a point cloud network, the training set including a sample point cloud of a sample object or scene. The network device may train the point cloud network based on a sample point cloud training set.

3. Area prediction network

The network equipment acquires a training set of the regional prediction network, wherein the training set can comprise a sample point cloud marked with object region types and positioning information; and training the regional prediction network through the training set, specifically, predicting the object region type and the positioning information of the sample point cloud, converging the predicted type and the real type, and converging the predicted positioning information and the real positioning information to obtain the trained regional prediction network.

The network training can be executed by the network device itself, or the network device acquires the application after the training of other devices is completed. It should be understood that the network applied by the embodiment of the present invention is not limited to the above-mentioned training, and may be trained by other ways.

And secondly, object detection can be carried out based on the point cloud through the trained semantic segmentation network, the point cloud network and the area prediction network, and specific reference can be made to fig. 5a and 5 b.

As shown in fig. 5a, a specific flow of an object detection method may be as follows:

501. the network device obtains an image and a point cloud of a scene.

For example, the network device may acquire an image and a point cloud of a scene from an image acquisition device and a point cloud acquisition device, respectively

502. And the network equipment performs semantic segmentation on the image of the scene by adopting a semantic segmentation network to obtain foreground pixels.

Referring to fig. 5b, taking an automatic driving scene as an example, a road scene image may be collected first, and a 2D semantic segmentation network may be adopted to segment the image of the scene to obtain segmentation results, including foreground pixels, background pixels, and the like.

503. And the network equipment maps the foreground pixel points to the point cloud of the scene to obtain the foreground points in the point cloud.

For example, a deplab v3 based on X-convergence may be used as a segmentation network, and the image of the scene is segmented by the segmentation network to obtain foreground pixels, such as foreground pixels of a car, a pedestrian, and a riding person in automatic driving. Then, the points in the point cloud are projected into the image of the scene, and then the segmentation result in the corresponding picture is used as the segmentation result of the point, so as to generate the foreground points in the point cloud. The method can accurately detect the foreground points in the point cloud.

504. The network equipment constructs a three-dimensional object area corresponding to each foreground point based on each foreground point and the preset size to obtain initial positioning information of the candidate object area.

For example, a three-dimensional object region corresponding to the foreground point is generated by taking the foreground point as a central point and according to a preset size.

For example, referring to fig. 5b, after the foreground point is obtained, a point-Based candidate object region (point-Based disposition Generation) may be generated by generating an object region corresponding to the foreground point with the foreground point as a central point and according to a predetermined size.

The detailed object region candidates can be referred to fig. 2a to 2b and the related description above.

505. And the network equipment extracts the features of all the points in the point cloud through the point cloud network to obtain a feature set corresponding to the point cloud.

Referring to fig. 5B, all the points in the point cloud (B, N, 4) may be input to PointNet + +, and the features of the point cloud are extracted by PointNet + +, so as to obtain (B, N, C).

The specific point cloud network structure and the feature extraction process may refer to the description of the above embodiments.

506. The network equipment constructs the regional characteristic information of the candidate object region based on the characteristic set.

Referring to fig. 5b, after obtaining the location information of the candidate object region and the Feature of the point cloud, the network device may generate region Feature information (i.e., the forward Feature Generation) of the candidate object based on the Feature of the point cloud.

For example, the network device selects a plurality of target points in the candidate object region; extracting the features of the target point from the feature set to obtain first part feature information of the candidate object region; standardizing the position information of the target point to obtain standardized position information of the target point; fusing the first part of feature information and the standardized position information to obtain fused feature information of the target point; performing spatial transformation on the fused feature information of the target to obtain transformed position information of the target point; based on the transformed position information, adjusting the standardized position information of the target point to obtain second part feature information of the candidate object region; and fusing the first part of feature information and the second part of feature information to obtain the region features of the candidate region.

Specifically, the region feature generation may refer to the description of the above embodiment and fig. 3.

507. The network equipment predicts the type and the positioning information of the candidate object area based on the area prediction network and the area characteristic information to obtain the prediction type and the prediction positioning information of the candidate object area.

For example, referring to fig. 5b, the candidate regions may be classified (cls) and regressed (reg) through a boundary Prediction network (Box Prediction Net), so as to predict the types and regression parameters of the candidate regions, where the regression parameters are Prediction positioning information, and include parameters such as three-dimensional coordinates, length, width, height, and orientation, such as (x, y, z, l, h, w, angle).

508. And the network equipment optimizes the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region and the predicted positioning information to obtain an optimized object detection region and positioning information of the optimized object detection region.

For example, the network device may screen the candidate object region based on the predicted type of the candidate object region to obtain a screened object region; and optimizing and adjusting the initial positioning information of the screened object area according to the predicted positioning information of the screened object area to obtain an optimized object detection area and positioning information thereof.

In the embodiment of the invention, all point clouds can be used as input, and then a PointNet + + structure is used for generating characteristics for each point in the point clouds. And then generating a candidate region by taking each point in the point cloud as an anchor point. And then, taking the characteristics of each point as input, optimizing the candidate region, and generating a final detection result.

Moreover, the algorithm capabilities provided by embodiments of the present invention are tested in some data sets, for example, the capabilities of the algorithm provided by embodiments of the present invention are tested on an open-source autonomous driving data set, such as a KITTI data set, which is an autonomous driving data set, and has objects of various sizes and distances at the same time, which is very challenging. The algorithm of the embodiment of the invention exceeds all the existing 3D object detection algorithms in KITTI, achieves a brand-new state-of-the-art, and is the best algorithm far beyond the previous one in the difficulty set.

On the KITTI dataset, point clouds of 7481 training images and 7518 test images of three classes (car, pedestrian and bicycle) were tested. And the Average Precision (AP) of the most extensive experiment is adopted to carry out measurement comparison with other methods, and other methods comprise MV3D (Multi-View 3D Object Detection), AVOD (Aggregate View Object Detection), VoxelNet (3D pixel network), F-PointNet (cluster-PointNet, View cone point cloud network) and AVOD-FPN (Multi-View Object Detection-View cone point cloud network). The test results are shown in fig. 5 c. Therefore, the accuracy of the object detection method provided by the embodiment of the invention is obviously higher than that of other methods.

In order to better implement the method, correspondingly, an embodiment of the present invention further provides an object detection device, where the object detection device may be specifically integrated in a network device, and the network device may be a server, a terminal, a vehicle-mounted device, an unmanned aerial vehicle, or a micro processing box.

For example, as shown in fig. 6a, the object detection apparatus may include a detection unit 601, a region construction unit 602, a feature extraction unit 603, a feature construction unit 604, a prediction unit 605, and an optimization unit 606, as follows:

a detection unit 601, configured to detect a foreground point from a point cloud of a scene;

an area construction unit 602, configured to construct an object area corresponding to the foreground point based on the foreground point and a predetermined size, and obtain initial positioning information of a candidate object area;

a feature extraction unit 603, configured to perform feature extraction on all points in the point cloud based on a point cloud network to obtain a feature set corresponding to the point cloud;

a feature construction unit 604, configured to construct regional feature information of the candidate object region based on the feature set;

a predicting unit 605, configured to predict the type and the positioning information of the candidate object region based on a region prediction network and the region feature information, so as to obtain a predicted type and predicted positioning information of the candidate object region;

and the optimizing unit 606 is configured to perform optimization processing on the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region, and the prediction positioning information, so as to obtain an optimized object detection region and positioning information thereof.

In an embodiment, the detecting unit 601 may be configured to: performing semantic segmentation on the image of the scene to obtain foreground pixels; and determining a point corresponding to the foreground pixel in the point cloud of the scene as a foreground point.

In an embodiment, the region construction unit 602 may be specifically configured to generate the object region corresponding to the foreground point by using the foreground point as a central point and according to a predetermined size.

In an embodiment, referring to fig. 6b, the feature constructing unit 604 may specifically include:

a selection subunit 6041 configured to select a plurality of target points in the object candidate region;

an extracting sub-unit 6042, configured to extract a feature of the target point from the feature set, and obtain first partial feature information of the candidate object region;

a construction subunit 6043 configured to construct second partial feature information of the candidate object region based on the position information of the target point;

a fusion subunit 6045, configured to fuse the first partial feature information and the second partial feature information to obtain region feature information of the candidate region.

In an embodiment, the building sub-unit 6043 may be specifically configured to:

In an embodiment, referring to fig. 6c, the point cloud network comprises: the device comprises a first sampling network and a second sampling network connected with the first sampling network; the feature extraction unit 603 may include:

a downsampling subunit 6031, configured to perform feature downsampling on all points in the point cloud through the first sampling network to obtain initial features of the point cloud;

an upsampling sub-unit 6032, configured to perform upsampling on the initial feature through the second sampling network to obtain a feature set of the point cloud.

the down-sampling sub-unit 6031 may specifically be configured to:

inputting the initial features of the point cloud to a second sampling network;

the upsampling sub-unit 6032 may specifically be configured to:

In one embodiment, the regional prediction network comprises a feature extraction network, a classification network connected to the sampling network, and a regression network connected to the feature extraction network; referring to fig. 6d, the prediction unit 605 may specifically include:

a global feature extraction subunit 6051, configured to perform feature extraction on the region feature information through the feature extraction network to obtain global feature information of the candidate object region;

a classification subunit 6052, configured to classify the candidate object region based on the classification network and the global feature information, so as to obtain a prediction type of the candidate region;

and a regression subunit 6053, configured to perform positioning on the candidate object region based on the regression network and the global feature information, to obtain predicted positioning information of the candidate region.

In one embodiment, the feature extraction network comprises: the system comprises a plurality of sequentially connected set abstraction layers, a classification network and a regression network, wherein the classification network comprises a plurality of sequentially connected full-connection layers, and the regression network comprises a plurality of sequentially connected full-connection layers; the global feature extraction subunit 6051 is configured to perform feature extraction on the region feature information sequentially through a set abstraction layer in the feature extraction network, to obtain global feature information of the candidate object region.

In an embodiment, referring to fig. 6e, the optimizing unit 606 may specifically include:

a screening subunit 6061, configured to screen the candidate object region based on the prediction type of the candidate object region, to obtain a screened object region;

and an optimization subunit 6062, configured to perform optimization and adjustment on the initial positioning information of the screened object region according to the predicted positioning information of the screened object region, to obtain an optimized object detection region and positioning information thereof.

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, the object detection apparatus of the present embodiment can detect the foreground point from the point cloud of the scene through the detection unit 601; then, an object region corresponding to the foreground point is constructed by a region construction unit 602 based on the foreground point and a predetermined size, so as to obtain initial positioning information of the candidate object region; extracting the features of all points in the point cloud by a feature extraction unit 603 based on a point cloud network to obtain a feature set corresponding to the point cloud; constructing, by the feature construction unit 604, regional feature information of the candidate object region based on the feature set; predicting the type and the positioning information of the candidate object region by a predicting unit 605 based on a region prediction network and the region feature information to obtain the predicted type and the predicted positioning information of the candidate object region; the optimization unit 606 performs optimization processing on the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region, and the prediction positioning information, to obtain an optimized object detection region and positioning information thereof. According to the scheme, the point cloud data of the scene can be adopted for object detection, the candidate detection area can be generated for each point, and the candidate detection area is optimized based on the area characteristics of the candidate detection area; therefore, the accuracy of object detection can be greatly improved, and the method is particularly suitable for 3D object detection.

In addition, an embodiment of the present invention further provides a network device, as shown in fig. 7, which shows a schematic structural diagram of the network device according to the embodiment of the present invention, specifically:

the network device may include components such as a processor 701 of one or more processing cores, memory 702 of one or more computer-readable storage media, a power supply 703, and an input unit 704. Those skilled in the art will appreciate that the network device architecture shown in fig. 7 does not constitute a limitation of network devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 701 is a control center of the network device, connects various parts of the entire network device by using various interfaces and lines, and performs various functions of the network device and processes data by running or executing software programs and/or modules stored in the memory 702 and calling data stored in the memory 702, thereby performing overall monitoring of the network device. Optionally, processor 701 may include one or more processing cores; preferably, the processor 701 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 701.

The memory 702 may be used to store software programs and modules, and the processor 701 executes various functional applications and data processing by operating the software programs and modules stored in the memory 702. The memory 702 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the network device, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 702 may also include a memory controller to provide the processor 701 with access to the memory 702.

The network device further includes a power source 703 for supplying power to each component, and preferably, the power source 703 may be logically connected to the processor 701 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The power supply 703 may also include any component including one or more of a dc or ac power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The network device may also include an input unit 704, the input unit 704 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the network device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 701 in the network device loads the executable file corresponding to the process of one or more application programs into the memory 702 according to the following instructions, and the processor 701 runs the application program stored in the memory 702, thereby implementing various functions as follows:

detecting a foreground point from a point cloud of a scene; constructing an object region corresponding to the foreground point based on the foreground point and a preset size to obtain initial positioning information of the candidate object region; extracting features of all points in the point cloud based on a point cloud network to obtain a feature set corresponding to the point cloud; constructing regional characteristic information of the candidate object region based on the characteristic set; predicting the type and the positioning information of the candidate object region based on a region prediction network and the region characteristic information to obtain the predicted type and the predicted positioning information of the candidate object region; and optimizing the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region and the prediction positioning information to obtain an optimized object detection region and positioning information thereof.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, the network device of this embodiment detects a foreground point from the point cloud of the scene; constructing an object region corresponding to the foreground point based on the foreground point and a preset size to obtain initial positioning information of a candidate object region; extracting features of all points in the point cloud based on a point cloud network to obtain a feature set corresponding to the point cloud; constructing regional characteristic information of the candidate object region based on the characteristic set; predicting the type and the positioning information of the candidate object region based on a region prediction network and the region characteristic information to obtain the predicted type and the predicted positioning information of the candidate object region; and optimizing the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region and the predicted positioning information to obtain an optimized object detection region and positioning information thereof. According to the scheme, the point cloud data of the scene can be adopted for object detection, the candidate detection area can be generated for each point, and the candidate detection area is optimized based on the area characteristics of the candidate detection area; therefore, the accuracy of object detection can be greatly improved, and the method is particularly suitable for 3D object detection.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present invention further provide a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the object detection methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

detecting foreground points from the point cloud of the scene; constructing an object region corresponding to the foreground point based on the foreground point and a preset size to obtain initial positioning information of the candidate object region; extracting features of all points in the point cloud based on a point cloud network to obtain a feature set corresponding to the point cloud; constructing regional characteristic information of the candidate object region based on the characteristic set; predicting the type and the positioning information of the candidate object region based on a region prediction network and the region characteristic information to obtain the predicted type and the predicted positioning information of the candidate object region; and optimizing the candidate object region based on the initial positioning information of the candidate region, the prediction type of the candidate object region and the predicted positioning information to obtain an optimized object detection region and positioning information thereof.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any object detection method provided in the embodiments of the present invention, the beneficial effects that can be achieved by any object detection method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The object detection method, apparatus, network device and storage medium provided by the embodiments of the present invention are described in detail above, and a specific example is applied in the present disclosure to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An object detection method, comprising:

detecting a foreground point from a point cloud of a scene, wherein the point cloud comprises a point set of scene or target surface characteristics;

constructing an object region corresponding to the foreground point based on the foreground point and a preset size to obtain initial positioning information of the candidate object region, wherein the initial positioning information of the object comprises position information and size information of the candidate object region;

selecting target points for all points in the point cloud through a sampling layer of a first sampling network, and defining the centroid of the candidate object area through the target points;

constructing, by a multi-scale packet layer of the first sampling network, a set of local regions based on the centroid;

encoding the set of local regions into initial features by a point cloud network layer of the first sampling network;

performing up-sampling operation on the initial features through a second sampling network to obtain a feature set of the point cloud;

selecting a plurality of target points in the candidate object region;

performing spatial transformation on the fused feature information of the target point to obtain transformed position information;

based on the transformed position information, adjusting the standardized position information of the target point to obtain second part feature information of the candidate object region;

fusing the first part of feature information and the second part of feature information to obtain region feature information of the candidate object region;

carrying out feature extraction on the regional feature information through a feature extraction network of a regional prediction network to obtain global feature information of a candidate object region, wherein the feature extraction network comprises a plurality of sequentially connected set abstraction layers;

classifying the candidate object region based on a classification network of a regional prediction network and the global feature information to obtain a prediction type of the candidate object region, wherein the classification network comprises a plurality of fully-connected layers which are connected in sequence;

positioning the candidate object region based on a regression network of a regional prediction network and the global feature information to obtain predicted positioning information of the candidate object region, wherein the regression network comprises a plurality of fully-connected layers for regression, and the predicted positioning information comprises predicted position information and predicted size information;

optimizing and adjusting initial positioning information of the screened object area according to the predicted positioning information of the screened object area to obtain an optimized object detection area and positioning information thereof, wherein optimizing and adjusting the initial positioning information of the screened object area according to the predicted positioning information of the screened object area to obtain the optimized object detection area and the positioning information thereof comprises the following steps:

and adjusting based on difference information between the predicted positioning information and the initial positioning information to obtain an optimized physical detection area and positioning information thereof, wherein the difference information comprises position difference information and size difference information, and the position difference information comprises three-dimensional coordinate difference information.

2. The object detection method of claim 1, wherein detecting a foreground point from a point cloud of a scene comprises:

and determining points corresponding to the foreground pixels in the point cloud of the scene as foreground points.

3. The object detection method of claim 1, wherein constructing the object region corresponding to the foreground point based on the foreground point and a predetermined size comprises: and generating an object area corresponding to the foreground point by taking the foreground point as a central point according to a preset size.

4. The object detection method of claim 1, wherein the feature extraction network comprises: the system comprises a plurality of sequentially connected set abstraction layers, a classification network and a regression network, wherein the classification network comprises a plurality of sequentially connected full-connection layers, and the regression network comprises a plurality of sequentially connected full-connection layers;

performing feature extraction on the region feature information through the feature extraction network to obtain global feature information of the candidate object region, wherein the feature extraction network comprises the following steps: and sequentially carrying out feature extraction on the regional feature information through a set abstraction layer in the feature extraction network to obtain the global feature information of the candidate object region.

5. An object detecting device, comprising:

the device comprises a detection unit, a processing unit and a processing unit, wherein the detection unit is used for detecting a foreground point from a point cloud of a scene, and the point cloud comprises a point set of the surface characteristics of the scene or a target;

the area construction unit is used for constructing an object area corresponding to the foreground point based on the foreground point and a preset size to obtain initial positioning information of the candidate object area, wherein the initial positioning information of the object comprises position information and size information of the candidate object area;

the target point selection unit is used for selecting target points for all points in the point cloud through a sampling layer of a first sampling network, and defining the centroid of the candidate object area through the target points;

a construction unit for constructing a set of local regions based on the centroid through a multi-scale packet layer of the first sampling network;

an encoding unit, configured to encode the local region set into an initial feature through a point cloud network layer of the first sampling network;

the up-sampling unit is used for performing up-sampling operation on the initial features through a second sampling network to obtain a feature set of the point cloud;

a selection unit configured to select a plurality of target points in the object candidate region;

an extracting unit, configured to extract a feature of the target point from the feature set, to obtain first partial feature information of the candidate object region;

the standardization unit is used for standardizing the position information of the target point to obtain the standardized position information of the target point;

the fusion unit is used for fusing the first part of feature information and the standardized position information to obtain fused feature information of a target point;

the spatial transformation unit is used for carrying out spatial transformation on the fused feature information of the target point to obtain transformed position information;

the information adjusting unit is used for adjusting the standardized position information of the target point based on the transformed position information to obtain second part characteristic information of the candidate object area;

an information fusion unit, configured to fuse the first part of feature information and the second part of feature information to obtain region feature information of the candidate object region;

the information feature extraction unit is used for extracting the features of the region feature information through a feature extraction network of a regional prediction network to obtain global feature information of a candidate object region, wherein the feature extraction network comprises a plurality of sequentially connected set abstraction layers;

the classification unit is used for classifying the candidate object regions based on a classification network of a regional prediction network and the global feature information to obtain the prediction types of the candidate object regions, wherein the classification network comprises a plurality of all-connected layers which are sequentially connected;

the positioning unit is used for positioning the candidate object region based on a regression network of a regional prediction network and the global feature information to obtain predicted positioning information of the candidate object region, wherein the regression network comprises a plurality of fully-connected layers for regression, and the predicted positioning information comprises predicted position information and predicted size information;

the screening unit is used for screening the candidate object region based on the prediction type of the candidate object region to obtain a screened object region;

the optimizing unit is used for optimizing and adjusting the initial positioning information of the screened object region according to the predicted positioning information of the screened object region to obtain an optimized object detection region and positioning information thereof, wherein the optimizing and adjusting the initial positioning information of the screened object region according to the predicted positioning information of the screened object region to obtain the optimized object detection region and positioning information thereof, and comprises:

6. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the object detection method according to any one of claims 1 to 4.

7. A network device comprising a memory and a processor; the memory stores a plurality of instructions, and the processor loads the instructions in the memory to perform the steps of the object detection method according to any one of claims 1 to 4.