WO2020199834A1 - 一种物体检测方法、装置、网络设备和存储介质 - Google Patents

一种物体检测方法、装置、网络设备和存储介质 Download PDF

Info

Publication number
WO2020199834A1
WO2020199834A1 PCT/CN2020/077721 CN2020077721W WO2020199834A1 WO 2020199834 A1 WO2020199834 A1 WO 2020199834A1 CN 2020077721 W CN2020077721 W CN 2020077721W WO 2020199834 A1 WO2020199834 A1 WO 2020199834A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
area
network
information
candidate object
Prior art date
Application number
PCT/CN2020/077721
Other languages
English (en)
French (fr)
Inventor
杨泽同
孙亚楠
贾佳亚
戴宇荣
沈小勇
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020199834A1 publication Critical patent/WO2020199834A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/255Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/06Recognition of objects for industrial automation

Definitions

  • This application relates to the field of artificial intelligence technology, specifically to object detection technology.
  • Object detection refers to determining the location and category of objects in a scene.
  • object detection technology has been widely used in various scenarios, such as autonomous driving and drones.
  • the current object detection scheme generally collects scene images, extracts features from the scene images, and then determines the position and category of the object in the scene image based on the extracted features.
  • the current object detection scheme has problems such as low object detection accuracy, especially in 3D object detection scenes.
  • the embodiments of the present application provide an object detection method, device, network device, and storage medium, which can improve the accuracy of object detection.
  • the embodiment of the present application provides an object detection method, which is executed by a network device, and includes:
  • the candidate object area is optimized to obtain the target object detection area and the positioning information of the target object detection area.
  • an embodiment of the present application also provides an object detection device, including:
  • the detection unit is used to detect the former scenic spot from the point cloud of the scene
  • An area constructing unit configured to construct a candidate object area corresponding to the front scenic spot based on the front scenic spot and a predetermined size, to obtain initial positioning information of the candidate object area
  • a feature extraction unit configured to perform feature extraction on all points in the point cloud based on a point cloud network to obtain a feature set corresponding to the point cloud;
  • a feature construction unit configured to construct the area feature information of the candidate object area based on the feature set
  • the prediction unit is configured to predict the type and location information of the candidate object area based on the area prediction network and the area feature information, and obtain the prediction type and predicted location information of the candidate object area;
  • the optimization unit is used to optimize the candidate object area based on the initial positioning information of the candidate object area, the prediction type of the candidate object area, and the predicted positioning information to obtain the target object detection area and the positioning information of the target object detection area.
  • An embodiment of the present application also provides a network device, including a memory and a processor; the memory stores multiple instructions, and the processor loads the instructions in the memory to execute any of the instructions provided in the embodiments of the present application. Steps in the object detection method.
  • an embodiment of the present application further provides a storage medium that stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute the steps in any object detection method provided in the embodiments of the present application. .
  • embodiments of the present application also provide a computer program product, including instructions, which when run on a computer, cause the computer to execute the steps in any object detection method provided in the embodiments of the present application.
  • the embodiment of the present application can detect the front scenic spot from the point cloud of the scene; construct the candidate object area corresponding to the front scenic spot based on the previous scenic spot and the predetermined size, and determine the initial positioning information of the candidate object area; Perform feature extraction on all points in the point cloud to obtain the feature set corresponding to the point cloud; construct the region feature information of the candidate object region based on the feature set; predict the candidate object based on the region prediction network and the region feature information
  • the type and location information of the area, the predicted type and predicted location information of the candidate object area are obtained; the candidate object area is optimized based on the initial location information of the candidate object area, the predicted type and predicted location information of the candidate object area, and the target object detection is obtained Area and location information of the target object detection area.
  • this solution can use the point cloud data of the scene for object detection, and can also generate candidate object regions for each front scenic spot in the point cloud, and optimize the candidate object regions based on the regional features of the candidate object regions; therefore, it can greatly Improve the accuracy of object detection, especially for 3D object detection, the detection effect is significantly improved.
  • FIG. 1a is a schematic diagram of a scene of an object detection method provided by an embodiment of the present application.
  • Figure 1b is a flowchart of an object detection method provided by an embodiment of the present application.
  • Figure 1c is a schematic structural diagram of a point cloud network provided by an embodiment of the present application.
  • Figure 1d is a schematic diagram of the PointNet++ network structure provided by an embodiment of the present application.
  • Figure 1e is a schematic diagram of an object detection effect in an automatic driving scene provided by an embodiment of the present application.
  • Figure 2a is a schematic diagram of image semantic segmentation provided by an embodiment of the present application.
  • FIG. 2b is a schematic diagram of point cloud segmentation provided by an embodiment of the present application.
  • Figure 2c is a schematic diagram of candidate region generation provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of feature construction of candidate regions provided by an embodiment of the present application.
  • Figure 4a is a schematic structural diagram of a regional prediction network provided by an embodiment of the present application.
  • FIG. 4b is another schematic diagram of the structure of the regional prediction network provided by an embodiment of the present application.
  • FIG. 5a is a schematic diagram of another process of object detection provided by an embodiment of the present application.
  • FIG. 5b is an architecture diagram of object detection provided by an embodiment of the present application.
  • FIG. 5c is a schematic diagram of test experiment results provided by an embodiment of the present application.
  • Figure 6a is a schematic structural diagram of an object detection device provided by an embodiment of the present application.
  • 6b is another schematic diagram of the structure of the object detection device provided by the embodiment of the present application.
  • FIG. 6c is another schematic diagram of the structure of the object detection device provided by the embodiment of the present application.
  • FIG. 6d is another schematic structural diagram of the object detection device provided by the embodiment of the present application.
  • 6e is another schematic diagram of the structure of the object detection device provided by the embodiment of the present application.
  • Fig. 7 is a schematic structural diagram of a network device provided by an embodiment of the present application.
  • the embodiments of the present application provide an object detection method, device, network device, and storage medium.
  • the object detection device may be integrated in a network device, and the network device may be a server or a terminal and other devices; for example, the network device may include equipment such as a vehicle-mounted device, a micro processing box and the like.
  • the so-called object detection refers to determining or recognizing the location and category of objects in a scene, for example, recognizing the category and location of objects in a road scene, such as street lights, vehicles and their locations.
  • an embodiment of the present application provides an object detection system including a network device and a collection device, etc.; the communication connection between the network device and the collection device, for example, through a wired or wireless network connection.
  • the network device and the collection device may be integrated into one device.
  • the collection device can be used to collect point cloud data or image data of the scene.
  • the collection device can upload the collected point cloud data to a network device for processing.
  • the network device can be used for object detection, specifically, it can detect the front scenic spot from the point cloud of the scene; construct the candidate object area corresponding to the former scenic spot based on the previous scenic spot and a predetermined size to obtain the initial positioning information of the candidate object area; based on the point cloud
  • the network performs feature extraction on all points in the point cloud to obtain the feature set corresponding to the point cloud; constructs the regional feature information of the candidate object region based on the feature set; predicts the type and location information of the candidate object region based on the regional prediction network and regional feature information , Obtain the prediction type and predicted positioning information of the candidate object area; optimize the candidate object area based on the initial positioning information of the candidate object area, the prediction type and predicted positioning information of the candidate object area, and obtain the target object detection area and its positioning information.
  • the detected objects can be identified in the scene image according to the location information.
  • the detected objects can be selected in the scene image in the form of a detection frame.
  • the type of the detected object may also be identified in the scene image.
  • the object detection device can be integrated in a network device.
  • the network device can be a server or a terminal; where the terminal can include a mobile phone, a tablet, Notebook computers, and personal computing (PC, Personal Computer), micro-processing terminals and other equipment.
  • PC Personal Computer
  • An object detection method provided by an embodiment of the present application may be executed by a processor of a network device. As shown in FIG. 1b, the specific process of the object detection method may be as follows:
  • the point cloud is a point collection of the surface characteristics of the scene or the target.
  • the points in the point cloud may include the position information of the points, such as three-dimensional coordinates, and may also include color information (RGB) or reflection intensity information (Intensity).
  • RGB color information
  • Intensity reflection intensity information
  • the point cloud can be detected by the principle of laser measurement or photogrammetry, for example, the point cloud of the object can be obtained by scanning with a laser scanner or a photographic scanner.
  • the principle of laser detection point cloud is: when a beam of laser irradiates the surface of an object, the reflected laser will carry information such as position and distance. If the laser beam is scanned according to a certain track, the reflected laser point information will be recorded while scanning. Because the scanning is extremely fine, a large number of laser points can be obtained, so a laser point cloud can be formed. Point cloud formats are *.las; *.pcd; *.txt, etc.
  • the point cloud data of the scene can be collected by the network device itself, or can be collected by other devices, the network device can be obtained from other devices, or searched from a network database, etc.
  • the scene can be multiple, for example, a road scene in automatic driving, an aviation scene in a drone flight, and so on.
  • the front scenic spot is relative to the background point.
  • a scene can be divided into a background and a foreground.
  • the points in the background can be called background points, and the points in the foreground can be called the front spots.
  • the point cloud of the scene can be semantically segmented to identify the front scenic spot in the point cloud of the scene.
  • the point cloud of the scene can be semantically segmented directly to obtain the previous scenic spot in the point cloud.
  • Semantic Segmentation refers to classifying each point in the point cloud of a scene so as to identify points belonging to a certain type.
  • semantic segmentation For example, 2D semantic segmentation or 3D semantic segmentation can be used to perform semantic segmentation on the point cloud.
  • the image of the scene may be segmented semantically to obtain foreground pixels, and then the foreground pixels Map it to the point cloud to get the front spot.
  • the step of "detecting the front scenic spot from the point cloud of the scene" may include:
  • the point corresponding to the foreground pixel in the point cloud of the scene is determined as the front scenic spot.
  • the foreground pixels can be mapped to the point cloud of the scene to obtain the target points in the point cloud corresponding to the foreground pixels. For example, it can be based on the mapping relationship between the pixels in the image and the points in the point cloud (such as The location mapping relationship, etc.) realize the mapping, and determine the target point having the mapping relationship with the foreground pixel as the front scenic spot.
  • the mapping relationship between the pixels in the image and the points in the point cloud such as The location mapping relationship, etc.
  • the points in the point cloud can be projected into the image of the scene.
  • the points in the point cloud can be projected into the image of the scene through the mapping relationship matrix or transformation matrix between the point cloud and the pixels.
  • the segmentation result (such as foreground pixels, background pixels, etc.) corresponding to the point in the image is used as the segmentation result of the point, and based on the segmentation result of the point, it is determined whether the point is a former scenic spot, and each former scenic spot is determined from the point cloud. Specifically, when the segmentation result of a point is a foreground pixel, it is determined that the point is a front scenic spot.
  • the semantic segmentation in the embodiments of the present application can be implemented by a segmentation network based on deep learning.
  • a segmentation network based on DeepLabV3 based on X-ception can be used, and the image of the scene can be performed through the segmentation network. Segmentation to obtain foreground pixels such as foreground pixels of cars, pedestrians, and cyclists in automatic driving. Then, the point in the point cloud is projected into the image of the scene, and then its corresponding segmentation result in the picture is used as the segmentation result of this point, thereby determining the front scenic spot in the point cloud. This method can accurately detect the front spot in the point cloud.
  • the embodiment of the present application may construct the object area corresponding to each front scenic spot based on the previous scenic spot and the predetermined size, and use the object area corresponding to the previous scenic spot as the candidate object area.
  • the candidate object area may be a two-dimensional area, that is, a 2D area, or a three-dimensional area, that is, a 3D area, which may be determined according to actual requirements.
  • the predetermined size can be set according to actual needs, and the predetermined size can include predetermined size parameters, for example, length l*width w in the 2D area, and length l*width w*height h in the 3D area.
  • the previous scenic spot can be the center point, and the candidate object area corresponding to the previous scenic spot can be generated according to a predetermined size.
  • the location information of the candidate object area may include position information, size information, and so on of the candidate object area.
  • the position information of the candidate object area may be represented by the position information of the reference point in the candidate object area, and the reference point may be set according to actual requirements, for example, The center point of the candidate object area is used as the reference point.
  • the position information of the candidate object area may include the 3D coordinates of the center point such as (x, y, z).
  • the size information of the candidate object area may include the size parameter of the area.
  • the size information of the candidate object area may include length l*width w
  • the candidate object area may include length l * width w * height h and so on.
  • the orientation of the object is also important reference information. Therefore, in some embodiments, the positioning information of the candidate object region may also include the orientation of the candidate object region, such as forward, backward, downward, Wait upward, the orientation of the candidate object area can indicate the orientation of the object in the scene. In practical applications, the orientation of the candidate object area can be expressed based on an angle. For example, two orientations can be defined, 0° and 90° respectively.
  • the candidate object area may be identified in the form of a detection frame, for example, 2D detection frame and 3D detection frame identification.
  • a 2D segmentation network can be used to semantically segment the image to obtain the image segmentation result (including foreground pixels, etc.); then, referring to Figure 2b, the image segmentation result is mapped to the point cloud, Obtain the point cloud segmentation result (including the former scenic spot). Then, with each front scenic spot as the center, a candidate object area is generated.
  • the schematic diagram of candidate object region generation is shown in Figure 2c. With each front scenic spot as the center, a 3D detection frame of artificially specified size is generated as a candidate object area.
  • the embodiment of the present application uses two orientations, 0° and 90° respectively.
  • this embodiment of the application can generate a candidate object area, such as a 3D candidate object detection frame, for each front scenic spot.
  • the point cloud network may be a network based on deep learning, for example, it may be a point cloud network such as PointNet and PointNet++.
  • the time sequence between step 103 and step 102 is not limited by the sequence number, and step 102 may be executed before step 103, or step 103 may be executed before step 102, or simultaneously.
  • all points in the point cloud can be input to the point cloud network, and the point cloud network performs feature extraction on the input points to obtain a feature set corresponding to the point cloud.
  • the point cloud network may include a first sampling network and a second sampling network; wherein, the first sampling network is connected to the second sampling network.
  • the first sampling network can be called an encoder
  • the second sampling network can be a decoder.
  • the feature downsampling process is performed on all points in the point cloud through the first sampling network to obtain the initial feature of the point cloud; the initial feature is upsampled through the second sampling network to obtain the feature set of the point cloud.
  • the first sampling network includes a plurality of set abstraction layers (SA) connected in sequence
  • the second sampling network includes a plurality of set abstraction layers (SA) connected in sequence, and is one-to-one with each set abstraction layer (SA) in the first sampling network.
  • the SA in the first sampling network corresponds to the FP in the second sampling network, and the number can be set according to actual needs.
  • the first sampling network and the second sampling network include three layers of SA and FP respectively.
  • the first sampling network can include three downsampling processing (that is, the encoding stage includes three steps of downsampling processing), the number of points are 1024, 256, and 64 respectively;
  • the second sampling network can include three upsampling processing (also That is, the decoding stage includes three steps of up-sampling processing), and the points of the three steps are 256, 1024, and N.
  • the feature extraction process of the point cloud network is as follows:
  • Input all the points of the point cloud to the first sampling network, and then divide the points in the point cloud into local areas through each collective abstraction layer (SA) in the first sampling network, and extract the features of the central point of the local area to obtain the point cloud
  • SA collective abstraction layer
  • the output point cloud feature is 64 ⁇ 1024.
  • pointnet++ uses the idea of layered feature extraction, and calls each time a set abstraction. Divided into three parts: sampling layer, grouping layer, feature extraction layer. First look at the sampling layer. In order to extract some relatively important center points from the dense point cloud, the farthest point sampling (FPS) method is adopted. Of course, random sampling is also possible. Then there is the grouping layer, which searches for the nearest k nearest neighbors to form a patch within a certain range of the center point extracted by the previous layer. The feature extraction layer is to perform convolution and pooling of these k points through a small pointnet network, and the obtained features are used as the features of this central point, and then sent to the next layer to continue. In this way, the center points obtained for each layer are a subset of the center points of the previous layer, and as the number of layers deepens, the number of center points decreases, but each center point contains more and more information.
  • FPS farthest point sampling
  • the first sampling network in the embodiment of the present application is composed of multiple SA layers. At each level, a set of points is processed and abstracted to generate a new set with fewer elements.
  • the collective abstraction layer consists of three key layers: sampling layer, grouping layer, and point cloud network layer (PointNet layer).
  • sampling layer selects a set of points from the input points, which define the centroid of the local area.
  • grouping layer constructs a set of local regions by finding the "adjacent" points around the centroid.
  • the point cloud network layer uses a micro point network to encode the local area set into a feature vector.
  • the embodiment of the present application proposes an improved SA layer.
  • the grouping layer (Grouping layer) in the SA layer can use Multi-scale grouping (MSG, multi-scale grouping).
  • MSG Multi-scale grouping
  • the local features under each radius are extracted during grouping, and then combined together .
  • the idea is to sample multi-scale features and concat (connect) in grouping layer.
  • MSG packets are used in the first and second SA layers.
  • a single-scale grouping may also be used in the SA, for example, a single-scale grouping (SSG) is used in the SA layer as the output.
  • the initial features of the point cloud can be input to the second sampling network, and the initial features are up-sampling processing such as residual up-sampling processing through the second sampling network.
  • the three-layer FP of the second sampling network performs up-sampling processing on 64 ⁇ 1024 features, and then outputs N ⁇ 128 features.
  • the step of "upsampling the initial features through the second sampling network to obtain the feature set of the point cloud” includes:
  • the current input feature is up-sampled through the current feature propagation layer to obtain the feature set of the point cloud.
  • the output feature of the previous layer can include the SA layer or the FP layer of the current FP layer.
  • the first FP layer after inputting 64*1024 point cloud features to the first FP layer, the first FP layer will The 64*1024 point cloud feature and the 256*256 feature input to the third SA layer are determined as the current input feature, and the feature is up-sampled, and the obtained feature is output to the second FP layer.
  • the second FP layer takes the output feature 256*128 feature of the previous FP layer and the 1024*128 feature input to the second SA layer as the input feature of the current layer, and performs up-sampling on the feature to obtain 1024*128 feature Enter the value of the third FP layer.
  • the third FP layer uses the 1024*128 features output by the second FP layer and the N*4 features input to the first SA layer as the input features of the current layer, and performs up-sampling processing to output the final feature of the point cloud.
  • feature extraction can be performed on all points in the point cloud to obtain a feature set of the point cloud, which prevents information loss and improves the accuracy of object detection.
  • the feature of some points can be selected from the feature set as the feature information of the candidate object area to which it belongs;
  • the position information of some points can be selected from the feature set as the feature information of the candidate object region to which they belong.
  • the feature and location information of some points can also be assembled to construct regional feature information.
  • the step of "constructing the region feature information of the candidate object region based on the feature set" may include:
  • the first part of feature information and the second part of feature information are fused to obtain the regional features of the candidate object region.
  • the number of target points and the selection method can be set according to actual needs. For example, a certain number of points can be selected randomly in the candidate object area or according to a certain selection method (such as selection based on the distance from the center point, etc.), such as selection 512 points.
  • the feature of the target point can be extracted from the feature set of the point cloud, and the extracted feature of the target point is used as the first part of the feature information of the candidate object area (which can be represented by F1).
  • the features of these 512 points can be extracted from the feature set (ie feature set) of the point cloud to form the first part of feature information F1.
  • the location information of the target point can be directly used as the second part of the feature information of the candidate object area (which can be represented by F2) .
  • the step of "constructing the second part of the feature information of the candidate object region based on the position information of the target point" may include:
  • the position information of the target point may include the coordinate information of the target point, such as 3D coordinates xyz, and the normalize of the position information can be set according to actual needs.
  • the target point can be determined based on the position information of the center point of the candidate object area.
  • Position information is adjusted. For example, subtract the 3D coordinates of the center of the candidate object from the 3D coordinates of the target point.
  • the first part of feature information and standardized location information are fused to obtain the fused feature information of the target point.
  • the two can be fused using Concat (connection) to obtain the fusion Features (B, N, C+3).
  • the fusion feature can also be spatially transformed.
  • a spatial transformation network may be used for transformation, for example, a supervised spatial transformation network such as T-Net may be used.
  • T-Net a supervised spatial transformation network
  • the merged features (B, N, C+3) can be spatially transformed through T-Net to obtain the transformed coordinates (B, 3).
  • the normalized position value of the target point can be subtracted from the transformed position value to obtain the second partial feature F2 of the candidate object region.
  • the normalized 3D coordinates (B, N, 3) of the target point can be subtracted from the transformed 3D coordinates (B, 3) to obtain the second partial feature F2.
  • the geometric stability or spatial invariance of the position feature can be improved, thereby improving the accuracy of feature extraction.
  • the first part feature information and the second part feature information of each candidate object area can be obtained by the above method, and then the two parts of features are fused to obtain the area feature information of each candidate object area.
  • F1 and F2 can be concatenated (Concat) to obtain the connected features (B, N, C+3) of the candidate object region, and this feature is used as the regional feature of the candidate object region.
  • the regional prediction network can be used to predict the type and location information of the candidate object area. For example, it can classify and locate the candidate object area to obtain the prediction type and predicted location information of the candidate prediction area.
  • the network can be based on deep learning.
  • the region prediction network can be trained from the point cloud or image of the sample object.
  • the predicted positioning information may include predicted position information such as 2D or 3D coordinates, dimensions such as length, width, and height.
  • it may also include predicted orientation information such as 0° or 90°.
  • the regional prediction network may include a feature extraction network, a classification network, and a regression network.
  • the classification network and the regression network are respectively connected to the feature extraction network. as follows:
  • the feature extraction network is used to perform feature extraction on input information, for example, perform feature extraction on the area feature information of the candidate object area to obtain the global feature information of the candidate object area.
  • the classification network is used to classify the area.
  • the candidate object area can be classified based on the global feature information of the candidate object area to obtain the prediction type of the candidate object area.
  • the regression network is used to locate the area, for example, to locate the candidate object area to obtain the predicted location information of the candidate object area. Because the regression network is used to predict the positioning, the output predicted positioning information can also be called regression information, such as predicted regression information.
  • the step of "predicting the type and location information of the candidate object area based on the area prediction network and area feature information to obtain the prediction type and predicted location information of the candidate object area” may include:
  • classify the candidate object area Based on the classification network and global feature information, classify the candidate object area to obtain the prediction type of the candidate object area;
  • the candidate object area is located, and the predicted location information of the candidate object area is obtained.
  • the feature extraction network in the embodiment of the present application may include: a plurality of sequentially connected collective abstraction layers, namely SA layers; the classification network may include a plurality of fully connected layers (fc) connected in sequence, As shown in Figure 4b, multiple fcs for classification are included, such as cls-fc1, cls-fc2, and cls-pred. Among them, the regression network includes multiple fully connected layers connected in sequence, as shown in Figure 4b, including multiple fcs for regression, such as reg-fc1, reg-fc2, and reg-pred. In the embodiment of the present application, the number of SA layers and fc layers can be set according to actual requirements.
  • the process of extracting the global feature information of the region may include: sequentially performing feature extraction on the region feature information through each set abstraction layer in the feature extraction network to obtain the global feature information of the candidate object region.
  • the structure of the collective abstraction layer can refer to the above introduction.
  • the grouping in the SA layer can be grouped in a single scale, that is, SSG grouping is used to improve the accuracy and efficiency of global feature extraction.
  • the regional prediction network can perform feature extraction on regional feature information in turn through three SA layers. For example, when the input feature input is M ⁇ 131 features, after three SA layer feature extraction, 128 ⁇ 128 and 32 ⁇ 256 and other features. After the SA layer feature extraction, the global feature information is obtained. At this time, the global feature information can be input to the classification network and the regression network respectively.
  • the classification network uses the first two cls-fc1 and cls-fc2 to perform dimensionality reduction processing on the global feature information, and performs classification prediction through the last cls-pred layer, and outputs the prediction type of the candidate object region.
  • the regression network uses the first two reg-fc1 and reg-fc2 to perform dimensionality reduction processing on the global feature information, and performs regression prediction through the last reg-pred layer to obtain the predicted location information of the candidate object region.
  • the type of the candidate object area can be set according to actual needs, for example, according to whether there are objects in the area, it can be divided into objects with or without objects; or according to quality, it can also be divided into high, medium, and low quality.
  • the type and positioning information of each candidate object area can be predicted.
  • the positioning information of the candidate object area may be adjusted based on the predicted positioning information first, and then the candidate object area may be filtered based on the prediction type.
  • the candidate object regions may be screened based on the prediction type first, and then the positioning information may be adjusted.
  • the step of "optimizing the candidate object area based on the initial positioning information, prediction type, and predicted positioning information to obtain the target object detection area and the positioning information of the target object detection area" may include:
  • the initial location information of the filtered object area is optimized and adjusted to obtain the target object detection area and the location information of the target object detection area.
  • the candidate object regions whose prediction type is empty regions can be filtered out, and then based on the predicted positioning information of the remaining candidate object regions after the filtering process, their initial positioning Information is optimized and adjusted.
  • the positioning information optimization adjustment method can be adjusted based on the difference information between the predicted positioning information and the initial positioning information, for example, the difference in the 3D coordinates of the area, the difference in size, and so on.
  • an optimal positioning information based on the predicted positioning information and the initial positioning information, and then adjust the positioning information of the candidate object area to the optimal positioning information. For example, determine the 3d coordinates and length, width, and height of an optimal area.
  • the object detection area can also be identified in the scene image based on the location information of the target object detection area.
  • the object detection method provided by the embodiment of the application can accurately detect in the automatic driving scene.
  • the position, size, and direction of objects on the current road are conducive to decision-making and judgment of autonomous driving.
  • the object detection provided by the embodiments of the present application may be applicable to various scenarios, such as scenarios such as autonomous driving, drones, and security monitoring.
  • the embodiment of the present application can detect the former scenic spot from the point cloud of the scene; construct the object area corresponding to the former scenic spot based on the former scenic spot and the predetermined size to obtain the initial positioning information of the candidate object area; compare the point cloud based on the point cloud network Perform feature extraction on all points of, to obtain the feature set corresponding to the point cloud; construct the regional feature information of the candidate object region based on the feature set; predict the type and location information of the candidate object region based on the region prediction network and regional feature information to obtain the candidate object region
  • the prediction type and prediction positioning information of the candidate object area; based on the initial positioning information of the candidate object area, the prediction type and prediction positioning information of the candidate object area, the candidate object area is optimized to obtain the target object detection area and the positioning information of the target object detection area.
  • Using the point cloud data of the scene for object detection can improve the accuracy of object detection.
  • this solution can also generate candidate object regions for each front scenic spot in the point cloud, which can avoid information loss.
  • candidate object regions are generated for each front scenic spot, that is, for any object, its corresponding candidate region will be generated. Therefore, it will not be affected by object scale changes and severe occlusion, which improves the effectiveness and success rate of object detection.
  • this solution can also optimize the candidate object region based on the region characteristics of the candidate object region; therefore, the accuracy and quality of object detection can be further improved.
  • the object detection device is specifically integrated in a network device as an example for description.
  • the network device can obtain a training set of the semantic segmentation network, which includes sample images labeled with pixel types (such as foreground pixels, background pixels, etc.).
  • pixel types such as foreground pixels, background pixels, etc.
  • the network device can train semantic segmentation based on the training set and loss function.
  • the sample image can be semantically segmented through the semantic segmentation network to obtain foreground pixels of the sample image, and then the segmented pixel type and the labeled pixel type are converged based on the loss function to obtain the trained semantic segmentation network.
  • the network device obtains a training set of the point cloud network, and the training set includes sample point clouds of sample objects or scenes.
  • the network device can train the point cloud network based on the sample point cloud training set.
  • the network device obtains the training set of the area prediction network, the training set may include the sample point cloud labeled with the object area type and positioning information; the area prediction network is trained through the training set, specifically, the object area type of the sample point cloud is predicted And the positioning information, the prediction type is converged with the real type, and the predicted positioning information is converged with the real positioning information to obtain the trained regional prediction network.
  • the foregoing network training may be performed by the network device itself, or it may be obtained by the network device after the training of other devices is completed. It should be understood that the network applied in the embodiment of the present application is not limited to training in the foregoing manner, and may also be trained in other manners.
  • an object detection method the specific process can be as follows:
  • the network device acquires an image and a point cloud of the scene.
  • network equipment can obtain scene images and point clouds from image acquisition equipment and point cloud acquisition equipment respectively
  • the network device uses a semantic segmentation network to perform semantic segmentation on the image of the scene to obtain foreground pixels.
  • a road scene image can be collected first, and a 2D semantic segmentation network can be used to segment the scene image to obtain a segmentation result, including foreground pixels, background pixels, and so on.
  • the network device maps the foreground pixels to the point cloud of the scene to obtain the front scenic spot in the point cloud.
  • X-ception-based DeepLabV3 can be used as a segmentation network, and the image of the scene can be segmented through the segmentation network to obtain foreground pixels such as foreground pixels of cars, pedestrians, and cyclists in autonomous driving. Then, the point in the point cloud is projected into the image of the scene, and then the segmentation result in the corresponding picture is used as the segmentation result of this point, and then the front scenic spot in the point cloud is generated. This method can accurately detect the front spot in the point cloud.
  • the network device constructs a three-dimensional candidate object area corresponding to each front scenic spot based on each front scenic spot and a predetermined size, and obtains initial positioning information of the candidate object area.
  • the previous scenic spot is the center point and the three-dimensional candidate object area corresponding to the previous scenic spot is generated according to a predetermined size.
  • the location information of the candidate object area may include position information, size information, and so on of the candidate object area.
  • the candidate object area corresponding to the previous scenic spot can be generated according to a predetermined size by using the previous scenic spot as the center point, that is, a Piont-Based Proposal Generation (Piont-Based Proposal Generation) can be generated.
  • the network device performs feature extraction on all points in the point cloud through the point cloud network to obtain a feature set corresponding to the point cloud.
  • all points in the point cloud (B, N, 4) can be input to PointNet++, and the feature of the point cloud can be extracted through PointNet++ to obtain (B, N, C).
  • the network device constructs regional feature information of the candidate object region based on the feature set.
  • the network device can generate the area feature information of the candidate object area based on the feature set of the point cloud (ie, Proposal Feature Generation).
  • the network device selects multiple target points in the candidate object area; extracts the characteristics of the target point from the feature set to obtain the first part of the feature information of the candidate object area; standardizes the position information of the target point to obtain the standardized position of the target point Information; the first part of the feature information and standardized location information are fused to obtain the fusion feature information of the target point; the fusion feature information of the target is spatially transformed to obtain the transformed location information of the target point; based on the transformed location information, The standardized position information of the target point is adjusted to obtain the second part of the feature information of the candidate object area; the first part of the feature information and the second part of the feature information are fused to obtain the regional feature of the candidate area.
  • the region feature generation can refer to the above-mentioned embodiment and the description of FIG. 3.
  • the network device predicts the type and location information of the candidate object area based on the area prediction network and the area feature information, and obtains the prediction type and predicted location information of the candidate object area.
  • the candidate region can be classified (cls) and regression (reg) through the Box Prediction Net, so as to predict the type and regression parameters of the candidate object region.
  • the regression parameters are predicted positioning information. Including three-dimensional coordinates, length, width and height, orientation and other parameters such as (x, y, z, l, h, w, angle).
  • the network device optimizes the candidate object area based on the initial positioning information of the candidate object area, the prediction type of the candidate object area, and the predicted positioning information to obtain the target object detection area and the positioning information of the target object detection area.
  • the network device can filter the candidate object regions based on the prediction type of the candidate object regions to obtain the filtered object regions; according to the predicted positioning information of the filtered object regions, the initial positioning information of the filtered object regions can be optimized and adjusted to obtain optimization Back object detection area and its positioning information.
  • the object detection area can also be identified in the scene image based on the location information of the target object detection area.
  • the object detection method provided by the embodiment of the application can accurately detect in the automatic driving scene.
  • the position, size, and direction of objects on the current road are conducive to decision-making and judgment of autonomous driving.
  • the embodiment of the present application may use all the point clouds as input, and then use a PointNet++ structure to generate features for each point in the point cloud. Then use each point in the point cloud as an anchor point to generate a candidate area. After that, the feature of each point is used as input to optimize the candidate area to generate the final detection result.
  • the algorithm capabilities provided by the embodiments of this application have been tested on some data sets.
  • the capabilities of the algorithms provided by the embodiments of this application have been tested on an open source autonomous driving data set such as the KITTI data set.
  • the KITTI data set is an automatic The driving data set, with objects of various sizes and distances at the same time, is very challenging.
  • the algorithm of the embodiment of this application surpasses all existing 3D object detection algorithms on KITTI, reaching a brand-new state-of-the-art, and at the same time, it is far superior to the previous best in the difficulty set. algorithm.
  • the point cloud of 7481 training images and the point cloud of 7518 test images of three categories (cars, pedestrians and cycling) are tested.
  • the average accuracy (AP) of the most extensive experiment is compared with other methods.
  • Other methods include MV3D (Multi-View 3D object detection, multi-modal 3D object detection), AVOD (Aggregate View Object Detection, multi-view object detection) , VoxelNet (3D pixel network), F-PointNet (Frustum-PointNet, cone point cloud network), AVOD-FPN (multi-view object detection-cone point cloud network).
  • Figure 5c shows the test results.
  • the accuracy of the object detection method (Ours in FIG. 5c) provided by the embodiment of the present application is significantly higher than other methods.
  • an embodiment of the present application also provides an object detection device.
  • the object detection device can be integrated in a network device.
  • the network device can be a server, a terminal, a vehicle-mounted device, Equipment such as drones can also be miniature processing boxes.
  • the object detection device may include a detection unit 601, a region construction unit 602, a feature extraction unit 603, a feature construction unit 604, a prediction unit 605, and an optimization unit 606, as follows:
  • the detection unit 601 is configured to detect the front scenic spot from the point cloud of the scene
  • the area constructing unit 602 is configured to construct a candidate object area corresponding to the front scenic spot based on the previous scenic spot and a predetermined size, and determine initial positioning information of the candidate object area;
  • the feature extraction unit 603 is configured to perform feature extraction on all points in the point cloud based on the point cloud network to obtain a feature set corresponding to the point cloud;
  • the feature construction unit 604 is configured to construct the area feature information of the candidate object area based on the feature set;
  • the prediction unit 605 is configured to predict the type and location information of the candidate object area based on the area prediction network and the area feature information, and obtain the prediction type and predicted location information of the candidate object area;
  • the optimization unit 606 is configured to perform optimization processing on the candidate object area based on the initial positioning information of the candidate object area, the prediction type of the candidate object area, and the predicted positioning information to obtain the target object detection area and the positioning information of the target object detection area.
  • the detection unit 601 is specifically configured to:
  • the point corresponding to the foreground pixel in the point cloud of the scene is determined as the front scenic spot.
  • the area construction unit 602 is specifically configured to:
  • the previous scenic spot is the center point, and the candidate object area corresponding to the previous scenic spot is generated according to a predetermined size.
  • the feature construction unit 604 specifically includes:
  • the selection subunit 6041 is configured to select multiple target points in the candidate object area
  • An extraction subunit 6042 configured to extract the feature of the target point from the feature set to obtain the first part of feature information of the candidate object region;
  • the constructing subunit 6043 is configured to construct the second part of the feature information of the candidate object region based on the position information of the target point;
  • the fusion subunit 6045 is configured to fuse the first partial feature information and the second partial feature information to obtain the region feature information of the candidate object region.
  • the subunit 6043 is constructed, specifically for:
  • the standardized position information of the target point is adjusted to obtain the second partial feature information of the candidate object region.
  • the point cloud network includes: a first sampling network, and a second sampling network connected to the first sampling network; the feature extraction unit 603 specifically includes:
  • a down-sampling subunit 6031 configured to perform feature down-sampling processing on all points in the point cloud through the first sampling network to obtain initial features of the point cloud;
  • the up-sampling subunit 6032 is configured to perform up-sampling processing on the initial features through the second sampling network to obtain a feature set of the point cloud.
  • the first sampling network includes a plurality of aggregate abstraction layers connected in sequence
  • the second sampling network includes a plurality of aggregate abstract layers connected in sequence and corresponding to each aggregate abstraction layer in the first sampling network.
  • the downsampling subunit 6031 is specifically used for:
  • the points in the point cloud are sequentially divided into local areas through the set abstraction layer, and the characteristics of the central points of the local areas are extracted to obtain the initial characteristics of the point cloud;
  • the up-sampling subunit 6032 is specifically used for:
  • the current input feature is up-sampled through the current feature propagation layer to obtain the feature set of the point cloud.
  • the regional prediction network includes a feature extraction network, a classification network connected to the feature extraction network, and a regression network connected to the feature extraction network; referring to FIG. 6d, the prediction unit 605 specifically includes:
  • the global feature extraction subunit 6051 is configured to perform feature extraction on the regional feature information through the feature extraction network to obtain global feature information of the candidate object region;
  • the classification subunit 6052 is configured to classify the candidate object region based on the classification network and the global feature information to obtain the prediction type of the candidate region;
  • the regression sub-unit 6053 is configured to locate the candidate object area based on the regression network and the global feature information to obtain predicted positioning information of the candidate object area.
  • the feature extraction network includes a plurality of sequentially connected collective abstraction layers
  • the classification network includes a plurality of sequentially connected fully connected layers
  • the regression network includes a plurality of sequentially connected fully connected layers
  • the global feature extraction subunit 6051 is specifically configured to perform feature extraction on the regional feature information in turn through the collective abstraction layer in the feature extraction network to obtain the global feature information of the candidate object region.
  • the optimization unit 606 specifically includes:
  • the screening subunit 6061 is used to screen candidate object regions based on the prediction type of the candidate object regions to obtain the filtered object regions;
  • the optimization subunit 6062 is configured to optimize and adjust the initial positioning information of the filtered object area according to the predicted positioning information of the filtered object area to obtain the target object detection area and the location information of the target object detection area.
  • each of the above units can be implemented as an independent entity, or can be combined arbitrarily, and implemented as the same or several entities.
  • each of the above units please refer to the previous method embodiments, which will not be repeated here.
  • the object detection device of this embodiment can detect the front scenic spot from the point cloud of the scene through the detection unit 601; then the region construction unit 602 constructs the candidate object area corresponding to the previous scenic spot based on the previous scenic spot and the predetermined size, and obtains The initial positioning information of the candidate object region; the feature extraction unit 603 performs feature extraction on all points in the point cloud based on the point cloud network to obtain the feature set corresponding to the point cloud; the feature construction unit 604 is based on the feature set Construct the region feature information of the candidate object region; the prediction unit 605 predicts the type and location information of the candidate object region based on the region prediction network and the region feature information, and obtains the prediction type and predicted location information of the candidate object region; The unit 606 optimizes the candidate object area based on the initial positioning information of the candidate object area, the prediction type of the candidate object area, and the predicted positioning information to obtain the target object detection area and its positioning information.
  • this solution can use the point cloud data of the scene for object detection, and can also generate candidate object regions for each front scenic spot, and optimize the candidate object regions based on the regional characteristics of the candidate object regions; therefore, it can greatly improve the object detection Accuracy, especially suitable for 3D object detection.
  • FIG. 7 shows a schematic structural diagram of the network device involved in the embodiment of the present application, specifically:
  • the network device may include one or more processing core processors 701, one or more computer-readable storage medium memory 702, power supply 703, input unit 704 and other components.
  • processing core processors 701 one or more computer-readable storage medium memory 702, power supply 703, input unit 704 and other components.
  • FIG. 7 does not constitute a limitation on the network device, and may include more or less components than shown in the figure, or combine some components, or arrange different components. among them:
  • the processor 701 is the control center of the network device. It uses various interfaces and lines to connect the various parts of the entire network device, runs or executes the software programs and/or modules stored in the memory 702, and calls the data stored in the memory 702. Data, perform various functions of network equipment and process data, so as to monitor the network equipment as a whole.
  • the processor 701 may include one or more processing cores; preferably, the processor 701 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 701.
  • the memory 702 may be used to store software programs and modules.
  • the processor 701 executes various functional applications and data processing by running the software programs and modules stored in the memory 702.
  • the memory 702 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of network equipment, etc.
  • the memory 702 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the memory 702 may further include a memory controller to provide the processor 701 with access to the memory 702.
  • the network device also includes a power supply 703 for supplying power to various components.
  • the power supply 703 may be logically connected to the processor 701 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
  • the power supply 703 may also include any components such as one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, and a power status indicator.
  • the network device may further include an input unit 704, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • an input unit 704 which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • the network device may also include a display unit, etc., which will not be repeated here.
  • the processor 701 in the network device loads the executable file corresponding to the process of one or more applications into the memory 702 according to the following instructions, and the processor 701 runs the executable file stored in The application programs in the memory 702 thus realize various functions, as follows:
  • Detect the previous scenic spot from the point cloud of the scene construct the candidate object area corresponding to the former scenic spot based on the previous scenic spot and the predetermined size to obtain the initial positioning information of the candidate object area; perform all the points in the point cloud based on the point cloud network Feature extraction to obtain the feature set corresponding to the point cloud; construct the region feature information of the candidate object region based on the feature set; predict the type and location information of the candidate object region based on the region prediction network and the region feature information, Obtain the prediction type and predicted positioning information of the candidate object area; optimize the candidate object area based on the initial positioning information of the candidate area, the prediction type and predicted positioning information of the candidate object area, and obtain the location of the target object detection area and the target object detection area information.
  • the network device of this embodiment detects the former scenic spot from the point cloud of the scene; constructs the candidate object area corresponding to the former scenic spot based on the former scenic spot and a predetermined size to obtain the initial positioning information of the candidate object area; based on the point cloud network Perform feature extraction on all points in the point cloud to obtain the feature set corresponding to the point cloud; construct the region feature information of the candidate object region based on the feature set; based on the region prediction network and the region feature information, Predict the type and positioning information of the candidate object area to obtain the prediction type and predicted positioning information of the candidate object area; optimize the candidate object area based on the initial positioning information of the candidate object area, the prediction type of the candidate object area and the predicted positioning information, and obtain Target object detection area and its location information.
  • this solution can use the point cloud data of the scene for object detection, and can also generate candidate object regions for each front scenic spot, and optimize the candidate object regions based on the regional characteristics of the candidate object regions; therefore, it can greatly improve the object detection Accuracy, especially suitable for 3D object detection.
  • an embodiment of the present application further provides a storage medium in which multiple instructions are stored, and the instructions can be loaded by a processor to execute the steps in any object detection method provided in the embodiments of the present application.
  • the instruction can perform the following steps:
  • Detect the previous scenic spot from the point cloud of the scene construct the candidate object area corresponding to the former scenic spot based on the previous scenic spot and the predetermined size to obtain the initial positioning information of the candidate object area; perform all the points in the point cloud based on the point cloud network Feature extraction to obtain the feature set corresponding to the point cloud; construct the region feature information of the candidate object region based on the feature set; predict the type and location information of the candidate object region based on the region prediction network and the region feature information, Obtain the prediction type and predicted positioning information of the candidate object area; optimize the candidate object area based on the initial positioning information of the candidate object area, the prediction type and predicted positioning information of the candidate object area, and obtain the target object detection area and target object detection area positioning information.
  • the storage medium may include: read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种物体检测方法、装置、网络设备和存储介质;本申请实施例可以从场景的点云中检测出前景点;基于前景点和预定尺寸构建前景点对应的候选物体区域,得到候选物体区域的初始定位信息;基于点云网络对点云中的所有点进行特征提取,得到点云对应的特征集;基于特征集构建候选物体区域的区域特征信息;基于区域预测网络和区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;基于候选物体区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域以及目标物体检测区域的定位信息。该方案可以提升物体检测的精确性。

Description

一种物体检测方法、装置、网络设备和存储介质
本申请要求于2019年04月03日提交中国专利局、申请号为201910267019.5、申请名称为“一种物体检测方法、装置、网络设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,具体涉及物体检测技术。
背景技术
物体检测是指在某个场景中确定物体的位置、类别等。目前物体检测技术已经广泛应用到各种场景中,比如,自动驾驶、无人机等场景。
目前的物体检测方案普遍是采集场景图像,从场景图像中提取特征,然后,基于提取的特征确定出物体在该场景图像中的位置和类别。然而,经过实践发现,目前的物体检测方案存在物体检测精确度较低等问题,尤其在3D物体检测场景。
发明内容
本申请实施例提供一种物体检测方法、装置、网络设备和存储介质,可以提升物体检测的精确性。
本申请实施例提供一种物体检测方法,由网络设备执行,包括:
从场景的点云中检测出前景点;
基于前景点和预定尺寸构建所述前景点对应的候选物体区域,确定候选物体区域的初始定位信息;
基于点云网络对所述点云中的所有点进行特征提取,得到所述点云对应的特征集;
基于所述特征集构建所述候选物体区域的区域特征信息;
基于区域预测网络和所述区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;
基于候选物体区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域以及目标物体检测区域的定位信息。
相应的,本申请实施例还提供一种物体检测装置,包括:
检测单元,用于从场景的点云中检测出前景点;
区域构建单元,用于基于前景点和预定尺寸构建所述前景点对应的候选物体区域,得到候选物体区域的初始定位信息;
特征提取单元,用于基于点云网络对所述点云中的所有点进行特征提取,得到所述点云对应的特征集;
特征构建单元,用于基于所述特征集构建所述候选物体区域的区域特征 信息;
预测单元,用于基于区域预测网络和所述区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;
优化单元,用于基于候选物体区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域以及目标物体检测区域的定位信息。
本申请实施例还提供了一种网络设备,包括存储器和处理器;所述存储器存储有多条指令,所述处理器加载所述存储器内的指令,以执行本申请实施例提供的任一种物体检测方法中的步骤。
此外,本申请实施例还提供一种存储介质,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行本申请实施例提供的任一种物体检测方法中的步骤。
此外,本申请实施例还提供了一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行本申请实施例提供的任一种物体检测方法中的步骤。
本申请实施例可以从场景的点云中检测出前景点;基于前景点和预定尺寸构建所述前景点对应的候选物体区域,并确定该候选物体区域的初始定位信息;基于点云网络对所述点云中的所有点进行特征提取,得到所述点云对应的特征集;基于所述特征集构建所述候选物体区域的区域特征信息;基于区域预测网络和所述区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;基于候选物体区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域以及目标物体检测区域的定位信息。由于该方案可以采用场景的点云数据进行物体检测,并且还可以针对点云中的每个前景点生成候选物体区域,基于候选物体区域的区域特征对候选物体区域进行优化处理;因此,可以大大提升物体检测的精确性,尤其对于3D物体检测来说检测效果提升得格外明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a是本申请实施例提供的物体检测方法的场景示意图;
图1b是本申请实施例提供的物体检测方法的流程图;
图1c是本申请实施例提供的点云网络的结构示意图;
图1d是本申请实施例提供的PointNet++网络结构示意图;
图1e是本申请实施例提供的自动驾驶场景中物体检测效果示意图;
图2a是本申请实施例提供的图像语义分割示意图;
图2b是本申请实施例提供的点云分割示意图;
图2c是本申请实施例提供的候选区域生成示意图;
图3是本申请实施例提供的候选区域特征构建示意图;
图4a是本申请实施例提供的区域预测网络的结构示意图
图4b是本申请实施例提供的区域预测网络的另一结构示意图;
图5a是本申请实施例提供的物体检测的另一流程示意图;
图5b是本申请实施例提供的物体检测的架构图;
图5c是本申请实施例提供的测试实验结果示意图;
图6a是本申请实施例提供的物体检测装置的结构示意图;
图6b是本申请实施例提供的物体检测装置的另一结构示意图;
图6c是本申请实施例提供的物体检测装置的另一结构示意图;
图6d是本申请实施例提供的物体检测装置的另一结构示意图;
图6e是本申请实施例提供的物体检测装置的另一结构示意图;
图7是本申请实施例提供的网络设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供一种物体检测方法、装置、网络设备和存储介质。其中,该物体检测装置可以集成在网络设备中,该网络设备可以是服务器,也可以是终端等设备;比如,网络设备可以包括、车载设备、微型处理盒子等设备。
所谓物体检测,是指确定或识别某个场景中物体的位置、类别等,比如,识别某个道路场景中物体的类别和位置,如路灯、车辆及其位置等。
参考图1a,本申请实施例提供了物体检测***包括网络设备和采集设备等;网络设备与采集设备之间通讯连接,比如,通过有线或无线网络连接等。在一实施例中,网络设备与采集设备可以集成在一台设备。
其中,采集设备,可以用于采集场景的点云数据或者图像数据等,在一实施例中采集设备可以将采集到的点云数据上传给网络设备进行处理。
网络设备,可以用于物体检测,具体地,可以从场景的点云中检测出前景点;基于前景点和预定尺寸构建前景点对应的候选物体区域,得到候选物体区域的初始定位信息;基于点云网络对点云中的所有点进行特征提取,得到点云对应的特征集;基于特征集构建候选物体区域的区域特征信息;基于区域预测网络和区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;基于候选物体区域的初始定位信 息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域及其定位信息。实际应用中,在得到目标物体检测区域的定位信息之后,可以根据定位信息在场景图像中标识检测到的物体,比如,以检测框的方式在场景图像中框选出检测到的物体,在一实施例中,还可以在场景图像中标识检测到的物体的类型。
以下分别进行详细说明。需说明的是,以下实施例的描述顺序不作为对实施例优选顺序的限定。
本实施例将从物体检测装置的角度进行描述,该物体检测装置具体可以集成在网络设备中,该网络设备可以是服务器,也可以是终端等设备;其中,该终端可以包括手机、平板电脑、笔记本电脑、以及个人计算(PC,Personal Computer)、微型处理终端等设备。
本申请实施例提供的一种物体检测方法,该方法可以由网络设备的处理器执行,如图1b所示,该物体检测方法的具体流程可以如下:
101、从场景的点云中检测出前景点。
其中,点云为场景或目标表面特性的点集合,点云中的点可以包含点的位置信息如三维坐标,此外,还可以包括颜色信息(RGB)或反射强度信息(Intensity)。
点云可以通过激光测量原理或者摄影测量原理检测得到,比如,可以通过激光扫描仪、或者照相式扫描仪扫描得到物体的点云。激光检测点云的原理为:当一束激光照射到物体表面时,所反射的激光会携带方位、距离等信息。若将激光束按照某种轨迹进行扫描,便会边扫描边记录到反射的激光点信息,由于扫描极为精细,则能够得到大量的激光点,因而就可形成激光点云。点云格式有*.las;*.pcd;*.txt等。
本申请实施例中,场景的点云数据可以由网络设备自己采集,也可以由其他设备采集,网络设备从其他设备获取,或者,从网络数据库中搜索等等。
其中,场景可以为多种,比如,可以自动驾驶中的道路场景、无人机飞行中的航空场景等等。
其中,前景点是相对于背景点而言的,一个场景可以划分为背景和前景,背景中的点可以称为背景点、前景中的点可以称为前景点。本申请实施例可以通过对场景的点云进行语义分割,识别场景点云中的前景点。
本申请实施例中,从点云中检测出前景点的方式有多种,比如,可以直接对场景的点云进行语义分割,得到点云中的前景点。语义分割(Semantic Segmentation)是指对一个场景的点云中的每个点进行分类,从而识别出属于某个类型的点。语义分割的方式可以有多种,比如,可以采用2D语义分割或者3D语义分割对点云进行语义分割。
又比如,为了能够检测到更多的前景点、提升前景点的检测可信度和准确性,在一实施例中,可以先对场景的图像进行语义分割,得到前景像素, 然后,将前景像素映射到点云中,得到前景点。具体地,步骤“从场景的点云中检测出前景点”,可以包括:
对场景的图像进行语义分割,得到前景像素;
将场景的点云中与前景像素对应的点确定为前景点。
在一实施例中,可以将前景像素映射到场景的点云中,得到点云中与前景像素对应的目标点,譬如,可以基于图像中像素与点云中的点之间的映射关系(如位置映射关系等)实现映射,将与前景像素具有映射关系的目标点确定为前景点。
在另一实施例中,可以将点云中的点投影到场景的图像中,如通过点云与像素之间的映射关系矩阵或变换矩阵,将点云中的点投影到场景的图像中,然后,将点在图像中对应的分割结果(如前景像素、背景像素等)作为点的分割结果,基于点的分割结果确定该点是否为前景点,由此从点云中确定各个前景点,具体地,当点的分割结果为前景像素时,确定该点为前景点。
为了提升语义分割的精确性,本申请实施例的语义分割可以通过基于深度学习的分割网络来实现,比如,可以了基于X-ception的DeepLabV3作为的分割网络,通过该分割网络对场景的图像进行分割,得到前景像素如自动驾驶中的车、行人、骑行的人的前景像素点。然后,将点云中的点投影到场景的图像中,然后将其在图片中对应的分割结果,作为这个点的分割结果,由此确定点云中的前景点。该方式可以精确地检测出点云中的前景点。
102、基于前景点和预定尺寸构建前景点对应的候选物体区域,确定候选物体区域的初始定位信息。
在得到前景点之后,本申请实施例可以基于前景点和预定尺寸构建每个前景点对应的物体区域,将前景点对应的物体区域作为候选物体区域。
其中,候选物体区域可以为二维区域即2D区域,也可以为三维区域即3D区域,具体可以根据实际需求来定。其中,预定尺寸可以根据实际需求设定,预定尺寸可以包括预定的尺寸参数,比如,在2D区域中包括长l*宽w,在3D区域中包括长l*宽w*高h。
比如,为了提升物体检测的准确性,可以以前景点为中心点,按照预定尺寸生成前景点对应的候选物体区域。
其中,候选物体区域的定位信息可以包括候选物体区域的位置信息、尺寸信息等等。
比如,在一实施例中,为了便于物体检测过程中的后续计算,候选物体区域的位置信息可以由候选物体区域中参考点的位置信息表示,该参考点可以根据实际需求设定,比如,可以将候选物体区域的中心点作为参考点。例如,以三维区域为例,候选物体区域的位置信息可以包括中心点的3D坐标如(x、y、z)。
其中,候选物体区域的尺寸信息可以包括区域的尺寸参数,比如,候选 物体区域为2D区域时,候选物体区域的尺寸信息可以包括长l*宽w,候选物体区域为3D区域时,候选物体区域的尺寸信息可以包括长l*宽w*高h等。
此外,在一些场景中,物体的朝向也是比较重要的参考信息,因此,在一些实施例中,候选物体区域的定位信息还可以包括候选物体区域的朝向,如向前、向后、向下、向上等,该候选物体区域的朝向能够表明场景中的物体的朝向。实际应用中,候选物体区域的朝向可以基于角度来表示,比如,可以定义两个朝向,分别为0°和90°。
在实际应用中,为了便于物体检测和用户观察,候选物体区域可以以检测框的形式标识,比如,2D检测框、3D检测框标识。
譬如,以行驶道路场景为例,参考图2a可以采用2D分割网络对图像进行语义分割,得到图像分割结果(包括前景像素等);然后,参考图2b,将图像分割结果映射到点云中,得到点云分割结果(包含前景点)。接着,以每个前景点为中心,产生候选物体区域。候选物体区域生成示意图如图2c。以每个前景点为中心,生成一个人为规定大小的3D检测框,作为候选物体区域。候选物体区域以(x,y,z,l,h,w,angle)作为表示,其中x,y,z表示中心点的3D坐标,而l,h,w为我们设定的候选区域的长高宽。在实际实验中l=3.8,h=1.6,w=1.5。angle表示3D候选区域的朝向,当生成候选物体区域的时候,本申请实施例采用了两个朝向,分别是0°和90°。
通过上述步骤本申请实施例可以针对每个前景点生成一个候选物体区域,如3D候选物体检测框。
103、基于点云网络对点云中的所有点进行特征提取,得到点云对应的特征集。
其中,点云网络可以为基于深度学习的网络,比如,可以为PointNet、PointNet++等点云网络。本申请实施例中步骤103与步骤102之间的时序不受序号限制,可以是步骤102执行在步骤103之前,也可以是步骤103执行在步骤102之前,也可以同时执行。
具体地,可以将点云中所有的点输入至点云网络,点云网络对输入的点进行特征提取,以得到点云对应的特征集。
下面以PointNet++为例来介绍点云网络,如图1c所示,点云网络可以包括第一采样网络和第二采样网络;其中,第一采样网络与第二采样网络连接。在实际应用中,第一采样网络可以称为编码器,第二采样网络可以成为解码器。具体地,通过第一采样网络对点云中的所有点进行特征降采样处理,得到点云的初始特征;通过第二采样网络对初始特征进行上采样处理,得到点云的特征集。
参考图1d,第一采样网络包括多个依次连接的集合抽象层(SA,set abstraction),第二采样网络包括多个依次连接、且与第一采样网络中各集合抽象层(SA)一一对应的特征传播层(FP,feature propagation)。第一采 样网络中的SA和第二采样网络中的FP相对应,数量可以根据实际需求设定,比如,第一采样网络和第二采样网络分别包括三层SA、FP。
参考图1d,第一采样网络可以包括三次降采样处理(也即编码阶段包括三步降采样处理),点的数量分别为1024,256,64;第二采样网络可以包括三次上采样处理(也即解码阶段包括三步上采样处理),三步的点数为256,1024,N。点云网络提取特征过程如下:
将点云的所有点输入至第一采样网络,通过第一采样网络中各集合抽象层(SA)依次对点云中的点进行局部区域划分,并提取局部区域中心点的特征,得到点云的初始特征;比如,参考图1d,通过输入为点云N×4经过三层SA降采样处理后,输出点云的特征为64×1024特征。
本申请实施例中,pointnet++使用了分层抽取特征的思想,把每一次叫做set abstraction。分为三部分:采样层、分组层、特征提取层。首先来看采样层,为了从稠密的点云中抽取出一些相对较为重要的中心点,采用最远点采样法(farthest point sampling,FPS),当然也可以随机采样。然后是分组层,在上一层提取出的中心点的某个范围内寻找最近的k个近邻点组成patch。特征提取层是将这k个点通过小型pointnet网络进行卷积和pooling处理,得到的特征作为此中心点的特征,再送入下一个分层继续。这样每一层得到的中心点都是上一层中心点的子集,并且随着层数加深,中心点的个数越来越少,但是每一个中心点包含的信息越来越多。
根据上述描述,本申请实施例中第一采样网络由多个SA层组成,在每个层次上,处理和抽象一组点以产生具有较少元素的新集合。集合抽象层由三个关键层组成:采样层(Sampling layer)、分组层(Grouping layer)、点云网络层(PointNet layer)。采样层从输入点选择一组点,这些点定义局部区域的质心。分组层通过找到质心周围的“相邻”点来构造局部区域集合。点云网络层使用一个微型点网将局部区域集合编码成特征向量。
在一实施例中,考虑到实际点云很少是均匀分布的,在采样的时候,对于密集的区域,应该使用小尺度采样,以得到深入细致的特征(finest details),但在稀疏区域,应该使用大尺度采样,因为过小的尺度会导致稀疏处的采样不足。因此,本申请实施例提出了改良的SA层。具体地,在SA层中的分组层(Grouping layer)可以使用Multi-scale grouping(MSG,多尺度分组),具体地,在分组时把每种半径下的局部特征都提取出来,然后组合到一起。其思想是在grouping layer中,采样多尺度的特征,concat(连接)起来。比如,参考图1d,在第一、二层SA层中使用MSG分组。
此外,在一实施例中,为了提升采样密度变化的稳健性,在SA中还可以采用单一尺度分组(SSG),比如,在作为输出的SA层使用单一尺度分组(SSG)。
在第一采样网络输出点云的初始特征之后,可以将点云的初始特征输入 至第二采样网络,通过第二采样网络对初始特征进行上采样处理如残差上采样处理。比如,参考图1d,经过第二采样网络的三层FP对64×1024特征进行上采样处理后,输出N×128的特征。
在一实施中,为了提升防止特征梯度变化、或者丢失,在第二采样网络进行上采样处理时还需要考虑到第一采样网络中各SA层输出的特征。具体地,步骤“通过第二采样网络对初始特征进行上采样处理,得到点云的特征集”,包括:
将上一层的输出特征、以及当前特征传播层对应的集合抽象层的输入特征,确定为当前特征传播层的当前输入特征;
通过当前特征传播层对当前输入特征进行上采样处理,得到点云的特征集。
其中,上一层的输出特征可以包括当前FP层上一层的SA层或FP层,比如,参考图1d,在输入64*1024点云特征至第一个FP层,第一个FP层将64*1024点云特征、以及输入第三个SA层的256*256特征确定为当前输入特征,对该特征进行上采样处理,将得到的特征输出至第二个FP层。第二个FP层将上一FP层的输出特征256*128特征、与输入第二个SA层的1024*128特征作为当前层输入特征,并对该特征进行上采样处理,得到1024*128特征输入值第三个FP层。第三个FP层将第二个FP层输出的1024*128特征、与输入第一个SA层的N*4特征作为当前层输入特征,并进行上采样处理,输出点云的最终特征。
通过上述步骤可以对点云中所有点进行特征提取,得到点云的特征集,防止信息丢失,提升了物体检测的准确性。
104、基于特征集构建候选物体区域的区域特征信息。
本申请实施例基于点云的特征集构建候选物体区域的特征信息的方式可以有多种,比如,可以从特征集中选择一些点的特征作为其所属的候选物体区域的特征信息;又比如,还可以从特征集中选择一些点的位置信息作为其所属的候选物体区域的特征信息。
又比如,为提升区域特征的提取精确性,还可以集合一些点的特征和位置信息来构建区域特征信息。具体地,步骤“基于特征集构建候选物体区域的区域特征信息”,可以包括:
在候选物体区域中选择多个目标点;
从特征集中提取目标点的特征,得到候选物体区域的第一部分特征信息;
基于目标点的位置信息,构建候选物体区域的第二部分特征;
对第一部分特征信息与第二部分特征信息进行融合,得到候选物体区域的区域特征。
其中,目标点的数量和选择方式可以根据实际需求设定,比如,可以在候选物体区域中随机或者按照一定选择方式(如基于离中心点的距离来选择等)选择一定数量的点,如选择512个点。
在从候选物体区域中选择目标点之后,可以从点云的特征集中提取目标点的特征,提取的目标点的特征作为候选物体区域的第一部分特征信息(可以用F1表示)。比如,在随机选择512个点后,可以从点云的特征集(即特征集)中提取这512个点的特征组成第一部分特征信息F1。
譬如,参考图3,可以从点云的特征集(B、N、C)中crop(裁剪)候选物体区域内512个目标点的特征组成F1(B、M、C),M为目标点数量,如M=512,其中,N为点云中点的数量。
其中,基于目标点的位置信息构建候选物体区域的第二部分特征的方式可以有多种,比如,可以将目标点的位置信息直接作为候选物体区域的第二部分特征信息(可以用F2表示)。
又比如,为了提升位置特征的提取精确性,还可以在对位置信息做一些变换后构建候选物体区域的第二部分特征。比如,步骤“基于目标点的位置信息构建候选物体区域的第二部分特征信息”,可以包括:
(1)、对目标点的位置信息进行标准化处理,得到目标点的标准化位置信息。
其中,目标点的位置信息可以包括目标点的坐标信息如3D坐标xyz,位置信息的标准化处理(Normalize)可以根据实际需求设定,比如,可以基于候选物体区域的中心点位置信息对目标点的位置信息进行调整。譬如,将目标点的3D坐标减去候选物体区域中心的3D坐标等。
(2)、对第一部分特征信息和标准化位置信息进行融合,得到目标点的融合后特征信息。
比如,参考图3,可以将M=512个点的标准化位置信息(如3D坐标xyz)与第一部分特征F1进行融合,具体地,可以采用Concat(连接)方式对二者进行融合,得到融合后特征(B、N、C+3)。
(3)对目标点的融合后特征信息进行空间变换,得到目标点的变换后位置信息。
为了进一步提升第二部分特征的提取准确性,还可以对融合后特征进行空间变换。
比如,在一实施例中,可以采用空间变换网络(STN)进行变换,譬如,可以采用受监督的空间变换网络如T-Net。参考图3,可以通过T-Net对融合后特征(B、N、C+3)进行空间变换,得到变换后坐标(B、3)。
(4)、基于变换后位置信息,对目标点的标准化位置信息进行调整,得到候选物体区域的第二部分特征信息。
比如,可以将目标点的标准化位置值减去变换位置值,得到候选物体区域的第二部分特征F2。参考图3,可以将标准化处理(Normalize)的目标点3D坐标(B、N、3)减去变换后3D坐标(B、3)得到第二部分特征F2。
由于对特征进行空间变换,将位置特征减去变换后位置后,可以提升位 置特征的几何稳定性或者空间不变性,从而提升特征提取的精确性。
通过上述方式可以得到每个候选物体区域的第一部分特征信息和第二部分特征信息,然后,将这两部分特征进行融合便可以得到每个候选物体区域的区域特征信息。比如,参考图3,可以将F1与F2连接(Concat)得到候选物体区域的连接后特征(B、N、C+3),将该特征作为候选物体区域的区域特征。
105、基于区域预测网络和区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息。
其中,区域预测网络,可以用于预测候选物体区域的类型和定位信息,比如,可以对候选物体区域进行分类和定位,得到候选预测区域的预测类型和预测定位信息,该网络可以为基于深度学习的区域预测网络,可以由样本物体的点云或图像训练而成。
其中,预测定位信息可以包括预测的位置信息如2D或3D坐标、尺寸如长宽高等,此外在一实施例中,还可以包括预测的朝向信息如0°或90°。
下面介绍区域预测网络的结构,参考图4a,区域预测网络可以包括特征提取网络、分类网络以及回归网络,分类网络与回归网络分别与特征提取网络连接。如下:
其中,特征提取网络,用于对输入信息进行特征提取,比如,对候选物体区域的区域特征信息进行特征提取,得到候选物体区域的全局特征信息。
分类网络,用于对区域进行分类,比如,可以基于候选物体区域的全局特征信息对候选物体区域进行分类,得到候选物体区域的预测类型。
回归网络,用于对区域进行定位,比如,对候选物体区域进行定位,得到候选物体区域的预测定位信息。由于用回归网络预测定位,因此输出的预测定位信息也可以称为回归信息,如预测回归信息。
比如,步骤“基于区域预测网络和区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息”,可以包括:
通过特征提取网络对区域特征信息进行特征提取,得到候选物体区域的全局特征信息;
基于分类网络和全局特征信息,对候选物体区域进行分类,得到候选物体区域的预测类型;
基于回归网络和全局特征信息,对候选物体区域的进行定位,得到候选物体区域的预测定位信息。
为了提升预测的准确性,参考图4b,本申请实施例中特征提取网络可以包括:多个依次连接的集合抽象层即SA层;分类网络可以包括多个依次连接的全连接层(fc),如图4b所示,包括用于分类的多个fc,如cls-fc1、cls-fc2、cls-pred。其中,回归网络包括多个依次连接的全连接层,如图4b所示,包括多个用于回归的fc,如reg-fc1、reg-fc2、reg-pred。本申请实施例中,SA层和fc层的数量可以根据实际需求设定。
本申请实施例中,区域的全局特征信息提取过程可以包括:通过特征提取网络中各个集合抽象层依次对区域特征信息进行特征提取,得到候选物体区域的全局特征信息。
其中,集合抽象层的结构可以参考上述的介绍,在一实施例中,SA层中分组可以采用单一尺度的方式分组,即采用SSG分组,提升全局特征提取的准确性和效率。
参考图4b,区域预测网络可以通过三个SA层依次对区域特征信息进行特征提取,如当输入特征input为M×131特征时,经过三个SA层特征提取,分别得到128×128、32×256等特征。在经过SA层特征提取后,得到全局特征信息,此时,可以将全局特征信息分别输入至分类网络和回归网络。
分类网络通过前两个cls-fc1、cls-fc2对全局特征信息进行降维处理,并通过最后一个cls-pred层进行分类预测,输出候选物体区域的预测类型。
回归网络通过前两个reg-fc1、reg-fc2对全局特征信息进行降维处理,并通过最后一个reg-pred层进行回归预测,得到候选物体区域的预测定位信息。
其中,候选物体区域的类型可以根据实际需求设定,比如,按区域内是否有物体可以划分为有物体、没有物体;或者按质量划分还可以划分为质量高、中、低。
通过上述步骤可以预测出每个候选物体区域的类型和定位信息。
106、基于初始定位信息、预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域、以及目标物体检测区域的定位信息。
其中,优化方式可以多种,比如,可以先基于预测定位信息对候选物体区域的定位信息进行调整,然后,再基于预测类型筛选候选物体区域。又比如,在一实施例中,可以先基于预测类型筛选候选物体区域,然后,调整定位信息。
例如,步骤“基于初始定位信息、预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域及目标物体检测区域的定位信息”,可以包括:
基于候选物体区域的预测类型对候选物体区域进行筛选,得到筛选后物体区域;
根据筛选后物体区域的预测定位信息,对筛选后物体区域的初始定位信息进行优化调整,得到目标物体检测区域及目标物体检测区域的定位信息。
例如,当预测类型包括有物体区域、空区域的情况下,可以将预测类型为空区域的候选物体区域过滤掉,然后,基于过滤处理后剩余的候选物体区域的预测定位信息,对其初始定位信息进行优化调整。
具体地,定位信息优化调整方式,比如,可以基于预测定位信息与初始定位信息之间的差异信息进行调整,譬如,区域3D坐标的差值、尺寸差值等。
又比如,还可以基于预测定位信息和初始定位信息确定一个最优的定位 信息,然后,将候选物体区域的定位信息调整为该最优的定位信息。譬如,确定一个最优区域3d坐标和长宽高等。
在实际应用中,还可以基于目标物体检测区域的定位信息在场景图像中标识出物体检测区域,比如,参考图1e,采用本申请实施例提供的物体检测方法可以在自动驾驶场景中准确地检测当前道路上的物体的位置、大小、以及方向,有利于自动驾驶的决策和判断。
本申请实施例提供的物体检测可以适用于各种场景,比如,自动驾驶、无人机、安全监控等场景。
由上可知,本申请实施例可以从场景的点云中检测出前景点;基于前景点和预定尺寸构建前景点对应的物体区域,得到候选物体区域的初始定位信息;基于点云网络对点云中的所有点进行特征提取,得到点云对应的特征集;基于特征集构建候选物体区域的区域特征信息;基于区域预测网络和区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;基于候选物体区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域以及目标物体检测区域的定位信息该方案采用场景的点云数据进行物体检测,可以提升物体检测的准确性。
并且该方案还可以针对点云中的每个前景点生成候选物体区域,可以避免信息丢失,同时针对每个前景点生成候选物体区域,也即对于任意一个物体,都会产生其对应的候选区域,因此,不会受到物体尺度变化以及严重遮挡的影响,提升了物体检测的有效性和成功率。
此外,该方案还可以基于候选物体区域的区域特征对候选物体区域进行优化处理;因此,可以进一步提升物体检测的精确性和质量。
根据上面实施例所描述的方法,以下将举例作进一步详细说明。
在本实施例中,将以该物体检测装置具体集成在网络设备为例进行说明。
(一)分别对语义分割网络、点云网络以及区域预测网络进行训练,具体可以如下:
1、语义分割网络的训练。
首先,网络设备可以获取语义分割网络的训练集,该训练集包括标注了像素类型(如前景像素、背景像素等)的样本图像。
其中,网络设备可以基于该训练集、损失函数对语义分割进行训练。具体地,可以通过语义分割网络对样本图像进行语义分割,得到样本图像的前景像素,然后,基于损失函数对分割得到的像素类型与标注的像素类型进行收敛,得到训练后的语义分割网络。
2、点云网络的训练。
网络设备获取点云网络的训练集,该训练集包括样本物体或场景的样本点云。网络设备可以基于样本点云训练集对点云网络进行训练。
3、区域预测网络
网络设备获取区域预测网络的训练集,该训练集可以包括标注了物体区域类型和定位信息的样本点云;通过该训练集对区域预测网络进行训练,具体地,预测样本点云的物体区域类型和的定位信息,将预测类型与真实类型进行收敛,将预测定位信息与真实定位信息进行收敛,得到训练后的区域预测网络。
上述网络训练可以由网络设备自己执行,也可以由其他设备训练完成后,网络设备获取应用。应当理解的是本申请实施例应用的网络不仅限于上述方式来训练,还可以通过其他方式来训练。
(二)通过该训练好的语义分割网络、点云网络以及区域预测网络,便可以基于点云进行物体检测,具体可参见图5a和图5b。
如图5a所示,一种物体检测方法,具体流程可以如下:
501、网络设备获取场景的图像和点云。
比如,网络设备可以分别从图像采集设备和点云采集设备获取场景的图像和点云
502、网络设备采用语义分割网络对场景的图像进行语义分割,得到前景像素。
参考图5b,以自动驾驶场景为例,可以先采集道路场景图像,可以采用2D语义分割网络对场景的图像进行分割,得到分割结果,包括前景像素、背景像素等。
503、网络设备将前景像素点映射到场景的点云中,得到点云中的前景点。
比如,可以将基于X-ception的DeepLabV3作为的分割网络,通过该分割网络对场景的图像进行分割,得到前景像素如自动驾驶中的车、行人、骑行的人的前景像素点。然后,将点云中的点投影到场景的图像中,然后将其对应的图片中的分割结果,作为这个点的分割结果,进而产生点云中的前景点。该方式可以精确地检测出点云中的前景点。
504、网络设备基于每个前景点和预定尺寸构建每个前景点对应的三维候选物体区域,得到候选物体区域的初始定位信息。
比如,以前景点为中心点并按照预定尺寸生成前景点对应的三维候选物体区域。
其中,候选物体区域的定位信息可以包括候选物体区域的位置信息、尺寸信息等等。
比如,参考图5b,可以在得到前景点后,通过以前景点为中心点并按照预定尺寸生成前景点对应的候选物体区域,即生成基于点的候选物体区域(Piont-Based Proposal Generation)。
详细的候选物体区域可以参考图2a至图2b,以及上述的相关介绍。
505、网络设备通过点云网络对点云中的所有点进行特征提取,得到点云 对应的特征集。
参考图5b,可以将点云(B,N,4)中所有点输入到PointNet++,通过PointNet++提取点云的特征,得到(B,N,C)。
具体的点云网络结构和特征提取过程可以参考上述实施例的描述。
506、网络设备基于特征集构建候选物体区域的区域特征信息。
参考图5b,在得到候选物体区域的初始定位信息、以及点云的特征集后,网络设备可以基于点云的特征集生成候选物体区域的区域特征信息(即Proposal Feature Generation)。
比如,网络设备在候选物体区域中选择多个目标点;从特征集中提取目标点的特征,得到候选物体区域的第一部分特征信息;对目标点的位置信息进行标准化处理,得到目标点的标准化位置信息;对第一部分特征信息和标准化位置信息进行融合,得到目标点的融合后特征信息;对目标的融合后特征信息进行空间变换,得到目标点的变换后位置信息;基于变换后位置信息,对目标点的标准化位置信息进行调整,得到候选物体区域的第二部分特征信息;对第一部分特征信息与第二部分特征信息进行融合,得到候选区域的区域特征。
具体地,区域特征生成可以参考上述实施例和图3的描述。
507、网络设备基于区域预测网络和区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息。
比如,参考图5b,可以通过边界预测网络(Box Prediction Net)对候选区域进行分类(cls)以及回归(reg),从而预测候选物体区域的类型和回归参数,该回归参数即为预测定位信息,包括三维坐标、长宽高、朝向等参数如(x,y,z,l,h,w,angle)。
508、网络设备基于候选物体区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域以及目标物体检测区域的定位信息。
比如,网络设备可以基于候选物体区域的预测类型对候选物体区域进行筛选,得到筛选后物体区域;根据筛选后物体区域的预测定位信息,对筛选后物体区域的初始定位信息进行优化调整,得到优化后物体检测区域及其定位信息。
在实际应用中,还可以基于目标物体检测区域的定位信息在场景图像中标识出物体检测区域,比如,参考图1e,采用本申请实施例提供的物体检测方法可以在自动驾驶场景中准确地检测当前道路上的物体的位置、大小、以及方向,有利于自动驾驶的决策和判断。
本申请实施例可以将全部的点云作为输入,然后使用一个PointNet++的结构为点云中的每一个点产生特征。然后以点云中的每一个点为锚点生成候选区域。之后,以每一个点的特征作为输入,优化候选区域,从而生成最后的 检测结果。
并且,在一些数据集中测试了本申请实施例提供的算法能力,比如,在开源的自动驾驶数据集如KITTI数据集上测试了本申请实施例提供的算法的能力,其中KITTI数据集是一个自动驾驶数据集,同时拥有多种大小和距离的物体,非常具有挑战性。本申请实施例的算法在KITTI上超过了所有的现有的3D物体检测的算法,达到了一个全新的state-of-the-art,同时在其中的困难集上更是远超之前最好的算法。
在KITTI数据集上,测试了三类(汽车、行人和骑自行车)的7481训练图像的点云和7518的测试图像的点云。并采用最广泛实验的平均精度(AP)与其他方法进行度量比较,其他方法包括MV3D(Multi-View 3D object detection,多模态3D物体检测)、AVOD(Aggregate View Object Detection,多视图物体检测)、VoxelNet(3D像素网络)、F-PointNet(Frustum-PointNet,视锥点云网络)、AVOD-FPN(多视图物体检测-视锥点云网络)。如图5c所示为测试结果。从而结果来看本申请实施例提供的物体检测方法(图5c中的Ours)的精度明显高于其他方法。
为了更好地实施以上方法,相应的,本申请实施例还提供一种物体检测装置,该物体检测装置具体可以集成在网络设备中,该网络设备可以是服务器,也可以是终端、车载设备、无人机等设备,还可以为比如微型处理盒子等。
例如,如图6a所示,该物体检测装置可以包括检测单元601、区域构建单元602、特征提取单元603、特征构建单元604、预测单元605和优化单元606,如下:
检测单元601,用于从场景的点云中检测出前景点;
区域构建单元602,用于基于前景点和预定尺寸构建所述前景点对应的候选物体区域,确定候选物体区域的初始定位信息;
特征提取单元603,用于基于点云网络对所述点云中的所有点进行特征提取,得到所述点云对应的特征集;
特征构建单元604,用于基于所述特征集构建所述候选物体区域的区域特征信息;
预测单元605,用于基于区域预测网络和所述区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;
优化单元606,用于基于候选物体区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域及目标物体检测区域的定位信息。
在一实施例中,检测单元601,具体用于:
对场景的图像进行语义分割,得到前景像素;
将场景的点云中与前景像素对应的点确定为前景点。
在一实施例中,区域构建单元602,具体用于:
以前景点为中心点,按照预定尺寸生成所述前景点对应的候选物体区域。
在一实施例中,参考图6b,特征构建单元604,具体包括:
选择子单元6041,用于在所述候选物体区域中选择多个目标点;
提取子单元6042,用于从所述特征集中提取所述目标点的特征,得到所述候选物体区域的第一部分特征信息;
构建子单元6043,用于基于所述目标点的位置信息构建所述候选物体区域的第二部分特征信息;
融合子单元6045,用于对所述第一部分特征信息与所述第二部分特征信息进行融合,得到所述候选物体区域的区域特征信息。
在一实施例中,构建子单元6043,具体用于:
对所述目标点的位置信息进行标准化处理,得到目标点的标准化位置信息;
对所述第一部分特征信息和所述标准化位置信息进行融合,得到目标点的融合后特征信息;
对所述目标的融合后特征信息进行空间变换,得到变换后位置信息;
基于所述变换后位置信息,对所述目标点的标准化位置信息进行调整,得到候选物体区域的第二部分特征信息。
在一实施例中,参考图6c,所述点云网络包括:第一采样网络、与所第一采样网络连接的第二采样网络;所述特征提取单元603,具体包括:
降采样子单元6031,用于通过所述第一采样网络对所述点云中的所有点进行特征降采样处理,得到点云的初始特征;
上采样子单元6032,用于通过所述第二采样网络对所述初始特征进行上采样处理,得到点云的特征集。
在一实施例中,所述第一采样网络包括多个依次连接的集合抽象层,所述第二采样网络包括多个依次连接且与所述第一采样网络中各集合抽象层一一对应的特征传播层;
降采样子单元6031,具体用于:
通过所述集合抽象层依次对点云中的点进行局部区域划分,并提取局部区域中心点的特征,得到点云的初始特征;
将所述点云的初始特征输入至第二采样网络;
上采样子单元6032,具体用于:
将上一层的输出特征、以及当前特征传播层对应的集合抽象层的输出特征,确定为当前特征传播层的当前输入特征;
通过当前特征传播层对当前输入特征进行上采样处理,得到点云的特征集。
在一实施例中,所述区域预测网络包括特征提取网络、与特征提取网络 连接的分类网络、以及与特征提取网络连接的回归网络;参考图6d,预测单元605,具体包括:
全局特征提取子单元6051,用于通过所述特征提取网络对所述区域特征信息进行特征提取,得到候选物体区域的全局特征信息;
分类子单元6052,用于基于所述分类网络和所述全局特征信息,对所述候选物体区域进行分类,得到候选区域的预测类型;
回归子单元6053,用于基于所述回归网络和所述全局特征信息,对所述候选物体区域的进行定位,得到候选物体区域的预测定位信息。
在一实施例中,所述特征提取网络包括多个依次连接的集合抽象层,所述分类网络包括多个依次连接的全连接层,所述回归网络包括多个依次连接的全连接层;
所述全局特征提取子单元6051,具体用于通过特征提取网络中集合抽象层依次对区域特征信息进行特征提取,得到候选物体区域的全局特征信息。
在一实施例中,参考图6e,优化单元606,具体包括:
筛选子单元6061,用于基于候选物体区域的预测类型对候选物体区域进行筛选,得到筛选后物体区域;
优化子单元6062,用于根据筛选后物体区域的预测定位信息,对筛选后物体区域的初始定位信息进行优化调整,得到目标物体检测区域及目标物体检测区域的定位信息。
具体实施时,以上各个单元可以作为独立的实体来实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单元的具体实施可参见前面的方法实施例,在此不再赘述。
由上可知,本实施例的物体检测装置可以通过检测单元601从场景的点云中检测出前景点;然后由区域构建单元602基于前景点和预定尺寸构建所述前景点对应的候选物体区域,得到候选物体区域的初始定位信息;由特征提取单元603基于点云网络对所述点云中的所有点进行特征提取,得到所述点云对应的特征集;由特征构建单元604基于所述特征集构建所述候选物体区域的区域特征信息;由预测单元605基于区域预测网络和所述区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;由优化单元606基于候选物体区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域及其定位信息。由于该方案可以采用场景的点云数据进行物体检测,并且还可以针对每个前景点生成候选物体区域,基于候选物体区域的区域特征对候选物体区域进行优化处理;因此,可以大大提升物体检测的精确性,尤其适用于3D物体检测。
此外,本申请实施例还提供一种网络设备,如图7所示,其示出了本申请实施例所涉及的网络设备的结构示意图,具体来讲:
该网络设备可以包括一个或者一个以上处理核心的处理器701、一个或一个以上计算机可读存储介质的存储器702、电源703和输入单元704等部件。本领域技术人员可以理解,图7中示出的网络设备结构并不构成对网络设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器701是该网络设备的控制中心,利用各种接口和线路连接整个网络设备的各个部分,通过运行或执行存储在存储器702内的软件程序和/或模块,以及调用存储在存储器702内的数据,执行网络设备的各种功能和处理数据,从而对网络设备进行整体监控。可选的,处理器701可包括一个或多个处理核心;优选的,处理器701可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作***、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器701中。
存储器702可用于存储软件程序以及模块,处理器701通过运行存储在存储器702的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器702可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据网络设备的使用所创建的数据等。此外,存储器702可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器702还可以包括存储器控制器,以提供处理器701对存储器702的访问。
网络设备还包括给各个部件供电的电源703,优选的,电源703可以通过电源管理***与处理器701逻辑相连,从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。电源703还可以包括一个或一个以上的直流或交流电源、再充电***、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
该网络设备还可包括输入单元704,该输入单元704可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。
尽管未示出,网络设备还可以包括显示单元等,在此不再赘述。具体在本实施例中,网络设备中的处理器701会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器702中,并由处理器701来运行存储在存储器702中的应用程序,从而实现各种功能,如下:
从场景的点云中检测出前景点;基于前景点和预定尺寸构建所述前景点对应的候选物体区域,得到候选物体区域的初始定位信息;基于点云网络对所述点云中的所有点进行特征提取,得到所述点云对应的特征集;基于所述 特征集构建所述候选物体区域的区域特征信息;基于区域预测网络和所述区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;基于候选区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域及目标物体检测区域的定位信息。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
由上可知,本实施例的网络设备从场景的点云中检测出前景点;基于前景点和预定尺寸构建所述前景点对应的候选物体区域,得到候选物体区域的初始定位信息;基于点云网络对所述点云中的所有点进行特征提取,得到所述点云对应的特征集;基于所述特征集构建所述候选物体区域的区域特征信息;基于区域预测网络和所述区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;基于候选物体区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域及其定位信息。由于该方案可以采用场景的点云数据进行物体检测,并且还可以针对每个前景点生成候选物体区域,基于候选物体区域的区域特征对候选物体区域进行优化处理;因此,可以大大提升物体检测的精确性,尤其适用于3D物体检测。
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。
为此,本申请实施例还提供一种存储介质,其中存储有多条指令,该指令能够被处理器进行加载,以执行本申请实施例所提供的任一种物体检测方法中的步骤。例如,该指令可以执行如下步骤:
从场景的点云中检测出前景点;基于前景点和预定尺寸构建所述前景点对应的候选物体区域,得到候选物体区域的初始定位信息;基于点云网络对所述点云中的所有点进行特征提取,得到所述点云对应的特征集;基于所述特征集构建所述候选物体区域的区域特征信息;基于区域预测网络和所述区域特征信息,预测候选物体区域的类型和定位信息,得到候选物体区域的预测类型和预测定位信息;基于候选物体区域的初始定位信息、候选物体区域的预测类型和预测定位信息对候选物体区域进行优化处理,得到目标物体检测区域及目标物体检测区域定位信息。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
其中,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。
由于该存储介质中所存储的指令,可以执行本申请实施例所提供的任一种物体检测方法中的步骤,因此,可以实现本申请实施例所提供的任一种物体检测方法所能实现的有益效果,详见前面的实施例,在此不再赘述。
以上对本申请实施例所提供的一种物体检测方法、装置、网络设备和存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (23)

  1. 一种物体检测方法,由网络设备执行,所述方法包括:
    从场景的点云中检测出前景点;
    基于所述前景点和预定尺寸构建所述前景点对应的候选物体区域,确定所述候选物体区域的初始定位信息;
    基于点云网络对所述点云中的所有点进行特征提取,得到所述点云对应的特征集;
    基于所述特征集构建所述候选物体区域的区域特征信息;
    基于区域预测网络和所述区域特征信息,预测所述候选物体区域的类型和定位信息,得到所述候选物体区域的预测类型和预测定位信息;
    基于所述初始定位信息、所述预测类型和所述预测定位信息对所述候选物体区域进行优化处理,得到目标物体检测区域以及所述目标物体检测区域的定位信息。
  2. 如权利要求1所述的物体检测方法,所述从场景的点云中检测出前景点,包括:
    对所述场景的图像进行语义分割,得到前景像素;
    将所述场景的点云中与所述前景像素对应的点确定为所述前景点。
  3. 如权利要求1所述的物体检测方法,所述基于所述前景点和预定尺寸构建所述前景点对应的候选物体区域,包括:
    以所述前景点为中心点,按照所述预定尺寸生成所述前景点对应的候选物体区域。
  4. 如权利要求1所述的物体检测方法,所述基于所述特征集构建所述候选物体区域的区域特征信息,包括:
    在所述候选物体区域中选择多个目标点;
    从所述特征集中提取所述目标点的特征,得到所述候选物体区域的第一部分特征信息;
    基于所述目标点的位置信息构建所述候选物体区域的第二部分特征信息;
    对所述第一部分特征信息与所述第二部分特征信息进行融合,得到所述候选物体区域的区域特征信息。
  5. 如权利要求4所述的物体检测方法,所述基于所述目标点的位置信息构建所述候选物体区域的第二部分特征信息,包括:
    对所述目标点的位置信息进行标准化处理,得到所述目标点的标准化位置信息;
    对所述第一部分特征信息和所述标准化位置信息进行融合,得到所述目标点的融合后特征信息;
    对所述目标点的融合后特征信息进行空间变换,得到变换后位置信息;
    基于所述变换后位置信息,对所述目标点的标准化位置信息进行调整,得到所述候选物体区域的第二部分特征信息。
  6. 如权利要求1所述的物体检测方法,所述点云网络包括:第一采样网络、与所述第一采样网络连接的第二采样网络;所述基于点云网络对所述点云中的所有点进行特征提取,得到所述点云的特征集,包括:
    通过所述第一采样网络对所述点云中的所有点进行特征降采样处理,得到所述点云的初始特征;
    通过所述第二采样网络对所述初始特征进行上采样处理,得到所述点云的特征集。
  7. 如权利要求6所述的物体检测方法,所述第一采样网络包括多个依次连接的集合抽象层,所述第二采样网络包括多个依次连接且与所述第一采样网络中各集合抽象层一一对应的特征传播层;
    所述通过所述第一采样网络对所述点云中的所有点进行特征降采样处理,得到所述点云的初始特征,包括:
    通过多个所述集合抽象层依次对所述点云中的点进行局部区域划分,并提取局部区域中心点的特征,得到所述点云的初始特征;
    将所述点云的初始特征输入至所述第二采样网络;
    所述通过所述第二采样网络对所述初始特征进行上采样处理,得到所述点云的特征集,包括:
    将上一层的输出特征、以及当前特征传播层对应的集合抽象层的输入特征,确定为当前特征传播层的当前输入特征;
    通过所述当前特征传播层对所述当前输入特征进行上采样处理,得到所述点云的特征集。
  8. 如权利要求1所述的物体检测方法,所述区域预测网络包括特征提取网络、与所述特征提取网络连接的分类网络、以及与所述特征提取网络连接的回归网络;
    所述基于区域预测网络和所述区域特征信息,预测所述候选物体区域的类型和定位信息,得到所述候选物体区域的预测类型和预测定位信息,包括:
    通过所述特征提取网络对所述区域特征信息进行特征提取,得到所述候选物体区域的全局特征信息;
    基于所述分类网络和所述全局特征信息,对所述候选物体区域进行分类,得到所述候选物体区域的预测类型;
    基于所述回归网络和所述全局特征信息,对所述候选物体区域的进行定位,得到所述候选物体区域的预测定位信息。
  9. 如权利要求8所述的物体检测方法,所述特征提取网络包括多个依次连接的集合抽象层,所述分类网络包括多个依次连接的全连接层,所述回归网络包括多个依次连接的全连接层;
    所述通过所述特征提取网络对所述区域特征信息进行特征提取,得到所述候选物体区域的全局特征信息,包括:
    通过所述特征提取网络中各个集合抽象层依次对区域特征信息进行特征提取,得到所述候选物体区域的全局特征信息。
  10. 如权利要求1所述的物体检测方法,所述基于所述初始定位信息、所述预测类型和所述预测定位信息对所述候选物体区域进行优化处理,得到目标物体检测区域及所述目标物体检测区域的定位信息,包括:
    基于所述预测类型对所述候选物体区域进行筛选,得到筛选后物体区域;
    根据所述筛选后物体区域的预测定位信息,对所述筛选后物体区域的初始定位信息进行优化调整,得到所述目标物体检测区域及所述目标物体检测区域定位信息。
  11. 一种物体检测装置,包括:
    检测单元,用于从场景的点云中检测出前景点;
    区域构建单元,用于基于所述前景点和预定尺寸构建所述前景点对应的候选物体区域,确定所述候选物体区域的初始定位信息;
    特征提取单元,用于基于点云网络对所述点云中的所有点进行特征提取,得到所述点云对应的特征集;
    特征构建单元,用于基于所述特征集构建所述候选物体区域的区域特征信息;
    预测单元,用于基于区域预测网络和所述区域特征信息,预测所述候选物体区域的类型和定位信息,得到所述候选物体区域的预测类型和预测定位信息;
    优化单元,用于基于所述初始定位信息、所述预测类型和所述预测定位信息对所述候选物体区域进行优化处理,得到目标物体检测区域以及所述目标物体检测区域的定位信息。
  12. 如权利要求11所述的物体检测装置,所述检测单元,具体用于:
    对所述场景的图像进行语义分割,得到前景像素;
    将所述场景的点云中与所述前景像素对应的点确定为所述前景点。
  13. 如权利要求11所述的物体检测装置,所述区域构建单元,具体用于:
    以所述前景点为中心点,按照所述预定尺寸生成所述前景点对应的候选物体区域。
  14. 如权利要求11所述的物体检测装置,所述特征构建单元,具体包括:
    选择子单元,用于在所述候选物体区域中选择多个目标点;
    提取子单元,用于从所述特征集中提取所述目标点的特征,得到所述候选物体区域的第一部分特征信息;
    构建子单元,用于基于所述目标点的位置信息构建所述候选物体区域的第二部分特征信息;
    融合子单元,用于对所述第一部分特征信息与所述第二部分特征信息进行融合,得到所述候选物体区域的区域特征信息。
  15. 如权利要求14所述的物体检测装置,所述构建子单元,具体用于:
    对所述目标点的位置信息进行标准化处理,得到所述目标点的标准化位置信息;
    对所述第一部分特征信息和所述标准化位置信息进行融合,得到所述目标点的融合后特征信息;
    对所述目标点的融合后特征信息进行空间变换,得到变换后位置信息;
    基于所述变换后位置信息,对所述目标点的标准化位置信息进行调整,得到所述候选物体区域的第二部分特征信息。
  16. 如权利要求11所述的物体检测装置,所述点云网络包括:第一采样网络、与所述第一采样网络连接的第二采样网络;所述特征提取单元,具体包括:
    降采样子单元,用于通过所述第一采样网络对所述点云中的所有点进行特征降采样处理,得到所述点云的初始特征;
    上采样子单元,用于通过所述第二采样网络对所述初始特征进行上采样处理,得到所述点云的特征集。
  17. 如权利要求16所述的物体检测装置,所述第一采样网络包括多个依次连接的集合抽象层,所述第二采样网络包括多个依次连接且与所述第一采样网络中各集合抽象层一一对应的特征传播层;
    所述降采样子单元,具体用于:
    通过所述集合抽象层依次对所述点云中的点进行局部区域划分,并提取局部区域中心点的特征,得到所述点云的初始特征;
    将所述点云的初始特征输入至所述第二采样网络;
    所述上采样子单元,具体用于:
    将上一层的输出特征、以及当前特征传播层对应的集合抽象层的输入特征,确定为当前特征传播层的当前输入特征;
    通过所述当前特征传播层对所述当前输入特征进行上采样处理,得到所述点云的特征集。
  18. 如权利要求11所述的物体检测装置,所述区域预测网络包括特征提取网络、与所述特征提取网络网络连接的分类网络、以及与所述特征提取网络连接的回归网络;所述预测单元,具体包括:
    全局特征提取子单元,用于通过所述特征提取网络对所述区域特征信息进行特征提取,得到所述候选物体区域的全局特征信息;
    分类子单元,用于基于所述分类网络和所述全局特征信息,对所述候选物体区域进行分类,得到所述候选物体区域的预测类型;
    回归子单元,用于基于所述回归网络和所述全局特征信息,对所述候选 物体区域的进行定位,得到所述候选物体区域的预测定位信息。
  19. 如权利要求18所述的物体检测装置,所述特征提取网络包括多个依次连接的集合抽象层,所述分类网络包括多个依次连接的全连接层,所述回归网络包括多个依次连接的全连接层;
    所述全局特征提取子单元,具体用于通过所述特征提取网络中各个集合抽象层依次对所述区域特征信息进行特征提取,得到所述候选物体区域的全局特征信息。
  20. 如权利要求11所述的物体检测装置,所述优化单元,具体包括:
    筛选子单元,用于基于候选物体区域的预测类型对所述候选物体区域进行筛选,得到筛选后物体区域;
    优化子单元,用于根据所述筛选后物体区域的预测定位信息,对所述筛选后物体区域的初始定位信息进行优化调整,得到所述目标物体检测区域及所述目标物体检测区域的定位信息。
  21. 一种存储介质,其特征在于,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行权利要求1至10任一项所述的物体检测方法中的步骤。
  22. 一种网络设备,其特征在于,包括存储器和处理器;所述存储器存储有多条指令,所述处理器加载所述存储器内的指令,以执行权利要求1至10任一项所述的物体检测方法中的步骤。
  23. 一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行权利要求1至10任一项中所述的物体检测方法的步骤。
PCT/CN2020/077721 2019-04-03 2020-03-04 一种物体检测方法、装置、网络设备和存储介质 WO2020199834A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910267019.5A CN110032962B (zh) 2019-04-03 2019-04-03 一种物体检测方法、装置、网络设备和存储介质
CN201910267019.5 2019-04-03

Publications (1)

Publication Number Publication Date
WO2020199834A1 true WO2020199834A1 (zh) 2020-10-08

Family

ID=67237387

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/077721 WO2020199834A1 (zh) 2019-04-03 2020-03-04 一种物体检测方法、装置、网络设备和存储介质

Country Status (2)

Country Link
CN (1) CN110032962B (zh)
WO (1) WO2020199834A1 (zh)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633376A (zh) * 2020-12-24 2021-04-09 南京信息工程大学 基于深度学习的点云数据地物分类方法、***与存储介质
CN112766170A (zh) * 2021-01-21 2021-05-07 广西财经学院 基于簇类无人机图像的自适应分割检测方法及装置
CN112862017A (zh) * 2021-04-01 2021-05-28 北京百度网讯科技有限公司 点云数据的标注方法、装置、设备和介质
CN113205531A (zh) * 2021-04-30 2021-08-03 北京云圣智能科技有限责任公司 三维点云分割方法、装置及服务器
CN113240656A (zh) * 2021-05-24 2021-08-10 浙江商汤科技开发有限公司 视觉定位方法及相关装置、设备
CN113256793A (zh) * 2021-05-31 2021-08-13 浙江科技学院 一种三维数据处理方法及***
CN113674348A (zh) * 2021-05-28 2021-11-19 中国科学院自动化研究所 物体抓取方法、装置和***
CN114092478A (zh) * 2022-01-21 2022-02-25 合肥中科类脑智能技术有限公司 一种异常检测方法
CN114359561A (zh) * 2022-01-10 2022-04-15 北京百度网讯科技有限公司 一种目标检测方法及目标检测模型的训练方法、装置
CN114372944A (zh) * 2021-12-30 2022-04-19 深圳大学 一种多模态和多尺度融合的候选区域生成方法及相关装置
CN114549958A (zh) * 2022-02-24 2022-05-27 四川大学 基于上下文信息感知机理的夜间和伪装目标检测方法
CN114820465A (zh) * 2022-04-06 2022-07-29 合众新能源汽车有限公司 点云检测模型训练方法、装置、电子设备及存储介质
WO2023035822A1 (zh) * 2021-09-13 2023-03-16 上海芯物科技有限公司 一种目标检测方法、装置、设备及存储介质
CN115937644A (zh) * 2022-12-15 2023-04-07 清华大学 一种基于全局及局部融合的点云特征提取方法及装置
CN116229388A (zh) * 2023-03-27 2023-06-06 哈尔滨市科佳通用机电股份有限公司 基于目标检测网络的动车异物检测方法、***及设备
CN116912488A (zh) * 2023-06-14 2023-10-20 中国科学院自动化研究所 基于多目相机的三维全景分割方法及装置
CN116912238A (zh) * 2023-09-11 2023-10-20 湖北工业大学 基于多维识别网络级联融合的焊缝管道识别方法及***
CN117475397A (zh) * 2023-12-26 2024-01-30 安徽蔚来智驾科技有限公司 基于多模态传感器的目标标注数据获取方法、介质及设备

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032962B (zh) * 2019-04-03 2022-07-08 腾讯科技(深圳)有限公司 一种物体检测方法、装置、网络设备和存储介质
CN110400304B (zh) * 2019-07-25 2023-12-12 腾讯科技(深圳)有限公司 基于深度学习的物体检测方法、装置、设备及存储介质
JPWO2021024805A1 (zh) * 2019-08-06 2021-02-11
CN110837789B (zh) * 2019-10-31 2023-01-20 北京奇艺世纪科技有限公司 一种检测物体的方法、装置、电子设备及介质
EP4073688A4 (en) * 2019-12-12 2023-01-25 Guangdong Oppo Mobile Telecommunications Corp., Ltd. TARGET DETECTION METHOD, DEVICE, TERMINAL EQUIPMENT, AND MEDIA
CN111144304A (zh) * 2019-12-26 2020-05-12 上海眼控科技股份有限公司 车辆目标检测模型的生成方法、车辆目标检测方法及装置
CN111209840B (zh) * 2019-12-31 2022-02-18 浙江大学 一种基于多传感器数据融合的3d目标检测方法
CN111145174B (zh) * 2020-01-02 2022-08-09 南京邮电大学 基于图像语义特征进行点云筛选的3d目标检测方法
CN110807461B (zh) * 2020-01-08 2020-06-02 深圳市越疆科技有限公司 一种目标位置检测方法
CN111260773B (zh) * 2020-01-20 2023-10-13 深圳市普渡科技有限公司 小障碍物的三维重建方法、检测方法及检测***
CN111340766B (zh) * 2020-02-21 2024-06-11 北京市商汤科技开发有限公司 目标对象的检测方法、装置、设备和存储介质
CN113496160B (zh) * 2020-03-20 2023-07-11 百度在线网络技术(北京)有限公司 三维物体检测方法、装置、电子设备和存储介质
CN111444839B (zh) * 2020-03-26 2023-09-08 北京经纬恒润科技股份有限公司 一种基于激光雷达的目标检测方法及***
CN111578951B (zh) * 2020-04-30 2022-11-08 阿波罗智能技术(北京)有限公司 一种自动驾驶中用于生成信息的方法和装置
CN112215861A (zh) * 2020-09-27 2021-01-12 深圳市优必选科技股份有限公司 一种足球检测方法、装置、计算机可读存储介质及机器人
CN112183330B (zh) * 2020-09-28 2022-06-28 北京航空航天大学 基于点云的目标检测方法
WO2022126523A1 (zh) * 2020-12-17 2022-06-23 深圳市大疆创新科技有限公司 物体检测方法、设备、可移动平台及计算机可读存储介质
CN112598635B (zh) * 2020-12-18 2024-03-12 武汉大学 一种基于对称点生成的点云3d目标检测方法
CN112734931B (zh) * 2020-12-31 2021-12-07 罗普特科技集团股份有限公司 一种辅助点云目标检测的方法及***
CN113312983B (zh) * 2021-05-08 2023-09-05 华南理工大学 基于多模态数据融合的语义分割方法、***、装置及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150339541A1 (en) * 2014-05-22 2015-11-26 Nokia Technologies Oy Point cloud matching method
CN108010036A (zh) * 2017-11-21 2018-05-08 江南大学 一种基于rgb-d相机的物体对称轴检测方法
CN109242951A (zh) * 2018-08-06 2019-01-18 宁波盈芯信息科技有限公司 一种脸部实时三维重建方法
CN109345510A (zh) * 2018-09-07 2019-02-15 百度在线网络技术(北京)有限公司 物体检测方法、装置、设备、存储介质及车辆
CN109543601A (zh) * 2018-11-21 2019-03-29 电子科技大学 一种基于多模态深度学习的无人车目标检测方法
CN110032962A (zh) * 2019-04-03 2019-07-19 腾讯科技(深圳)有限公司 一种物体检测方法、装置、网络设备和存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017155970A1 (en) * 2016-03-11 2017-09-14 Kaarta, Inc. Laser scanner with real-time, online ego-motion estimation
CN109410238B (zh) * 2018-09-20 2021-10-26 中国科学院合肥物质科学研究院 一种基于PointNet++网络的枸杞识别计数方法
CN109410307B (zh) * 2018-10-16 2022-09-20 大连理工大学 一种场景点云语义分割方法
CN109523552B (zh) * 2018-10-24 2021-11-02 青岛智能产业技术研究院 基于视锥点云的三维物体检测方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150339541A1 (en) * 2014-05-22 2015-11-26 Nokia Technologies Oy Point cloud matching method
CN108010036A (zh) * 2017-11-21 2018-05-08 江南大学 一种基于rgb-d相机的物体对称轴检测方法
CN109242951A (zh) * 2018-08-06 2019-01-18 宁波盈芯信息科技有限公司 一种脸部实时三维重建方法
CN109345510A (zh) * 2018-09-07 2019-02-15 百度在线网络技术(北京)有限公司 物体检测方法、装置、设备、存储介质及车辆
CN109543601A (zh) * 2018-11-21 2019-03-29 电子科技大学 一种基于多模态深度学习的无人车目标检测方法
CN110032962A (zh) * 2019-04-03 2019-07-19 腾讯科技(深圳)有限公司 一种物体检测方法、装置、网络设备和存储介质

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633376A (zh) * 2020-12-24 2021-04-09 南京信息工程大学 基于深度学习的点云数据地物分类方法、***与存储介质
CN112766170A (zh) * 2021-01-21 2021-05-07 广西财经学院 基于簇类无人机图像的自适应分割检测方法及装置
CN112766170B (zh) * 2021-01-21 2024-04-16 广西财经学院 基于簇类无人机图像的自适应分割检测方法及装置
CN112862017B (zh) * 2021-04-01 2023-08-01 北京百度网讯科技有限公司 点云数据的标注方法、装置、设备和介质
CN112862017A (zh) * 2021-04-01 2021-05-28 北京百度网讯科技有限公司 点云数据的标注方法、装置、设备和介质
CN113205531A (zh) * 2021-04-30 2021-08-03 北京云圣智能科技有限责任公司 三维点云分割方法、装置及服务器
CN113205531B (zh) * 2021-04-30 2024-03-08 北京云圣智能科技有限责任公司 三维点云分割方法、装置及服务器
CN113240656A (zh) * 2021-05-24 2021-08-10 浙江商汤科技开发有限公司 视觉定位方法及相关装置、设备
CN113674348A (zh) * 2021-05-28 2021-11-19 中国科学院自动化研究所 物体抓取方法、装置和***
CN113674348B (zh) * 2021-05-28 2024-03-15 中国科学院自动化研究所 物体抓取方法、装置和***
CN113256793A (zh) * 2021-05-31 2021-08-13 浙江科技学院 一种三维数据处理方法及***
WO2023035822A1 (zh) * 2021-09-13 2023-03-16 上海芯物科技有限公司 一种目标检测方法、装置、设备及存储介质
CN114372944B (zh) * 2021-12-30 2024-05-17 深圳大学 一种多模态和多尺度融合的候选区域生成方法及相关装置
CN114372944A (zh) * 2021-12-30 2022-04-19 深圳大学 一种多模态和多尺度融合的候选区域生成方法及相关装置
CN114359561A (zh) * 2022-01-10 2022-04-15 北京百度网讯科技有限公司 一种目标检测方法及目标检测模型的训练方法、装置
CN114092478A (zh) * 2022-01-21 2022-02-25 合肥中科类脑智能技术有限公司 一种异常检测方法
CN114549958A (zh) * 2022-02-24 2022-05-27 四川大学 基于上下文信息感知机理的夜间和伪装目标检测方法
CN114549958B (zh) * 2022-02-24 2023-08-04 四川大学 基于上下文信息感知机理的夜间和伪装目标检测方法
CN114820465A (zh) * 2022-04-06 2022-07-29 合众新能源汽车有限公司 点云检测模型训练方法、装置、电子设备及存储介质
CN114820465B (zh) * 2022-04-06 2024-04-26 合众新能源汽车股份有限公司 点云检测模型训练方法、装置、电子设备及存储介质
CN115937644B (zh) * 2022-12-15 2024-01-02 清华大学 一种基于全局及局部融合的点云特征提取方法及装置
CN115937644A (zh) * 2022-12-15 2023-04-07 清华大学 一种基于全局及局部融合的点云特征提取方法及装置
CN116229388B (zh) * 2023-03-27 2023-09-12 哈尔滨市科佳通用机电股份有限公司 基于目标检测网络的动车异物检测方法、***及设备
CN116229388A (zh) * 2023-03-27 2023-06-06 哈尔滨市科佳通用机电股份有限公司 基于目标检测网络的动车异物检测方法、***及设备
CN116912488B (zh) * 2023-06-14 2024-02-13 中国科学院自动化研究所 基于多目相机的三维全景分割方法及装置
CN116912488A (zh) * 2023-06-14 2023-10-20 中国科学院自动化研究所 基于多目相机的三维全景分割方法及装置
CN116912238B (zh) * 2023-09-11 2023-11-28 湖北工业大学 基于多维识别网络级联融合的焊缝管道识别方法及***
CN116912238A (zh) * 2023-09-11 2023-10-20 湖北工业大学 基于多维识别网络级联融合的焊缝管道识别方法及***
CN117475397A (zh) * 2023-12-26 2024-01-30 安徽蔚来智驾科技有限公司 基于多模态传感器的目标标注数据获取方法、介质及设备
CN117475397B (zh) * 2023-12-26 2024-03-22 安徽蔚来智驾科技有限公司 基于多模态传感器的目标标注数据获取方法、介质及设备

Also Published As

Publication number Publication date
CN110032962B (zh) 2022-07-08
CN110032962A (zh) 2019-07-19

Similar Documents

Publication Publication Date Title
WO2020199834A1 (zh) 一种物体检测方法、装置、网络设备和存储介质
WO2020207166A1 (zh) 一种物体检测方法、装置、电子设备和存储介质
US10078790B2 (en) Systems for generating parking maps and methods thereof
Du et al. Car detection for autonomous vehicle: LIDAR and vision fusion approach through deep learning framework
US9142011B2 (en) Shadow detection method and device
JP6514192B2 (ja) ライダに基づいたオブジェクト移動の分類
CN111951212A (zh) 对铁路的接触网图像进行缺陷识别的方法
CN114708585A (zh) 一种基于注意力机制的毫米波雷达与视觉融合的三维目标检测方法
CN113378686B (zh) 一种基于目标中心点估计的两阶段遥感目标检测方法
CN110222686B (zh) 物体检测方法、装置、计算机设备和存储介质
CN111368600A (zh) 遥感图像目标检测识别方法、装置、可读存储介质及设备
Zhong et al. Multi-scale feature fusion network for pixel-level pavement distress detection
US20230102467A1 (en) Method of detecting image, electronic device, and storage medium
KR101907883B1 (ko) 객체 검출 및 분류 방법
CN113706480A (zh) 一种基于关键点多尺度特征融合的点云3d目标检测方法
CN110807362A (zh) 一种图像检测方法、装置和计算机可读存储介质
CN115731355B (zh) 一种基于SuperPoint-NeRF的三维建筑物重建方法
CN112733815B (zh) 一种基于rgb室外道路场景图像的红绿灯识别方法
CN113033516A (zh) 对象识别统计方法及装置、电子设备、存储介质
CN115147745A (zh) 一种基于城市无人机图像的小目标检测方法
Pellis et al. Assembling an image and point cloud dataset for heritage building semantic segmentation
Drobnitzky et al. Survey and systematization of 3D object detection models and methods
CN113281780A (zh) 对图像数据进行标注的方法、装置及电子设备
Chaturvedi et al. Small object detection using retinanet with hybrid anchor box hyper tuning using interface of Bayesian mathematics
Zhao et al. DHA: Lidar and vision data fusion-based on road object classifier

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20783617

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20783617

Country of ref document: EP

Kind code of ref document: A1