CN116958927A - Method and device for identifying short column based on BEV (binary image) graph - Google Patents

Method and device for identifying short column based on BEV (binary image) graph Download PDF

Info

Publication number
CN116958927A
CN116958927A CN202311089079.5A CN202311089079A CN116958927A CN 116958927 A CN116958927 A CN 116958927A CN 202311089079 A CN202311089079 A CN 202311089079A CN 116958927 A CN116958927 A CN 116958927A
Authority
CN
China
Prior art keywords
bev
image
mask
original image
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311089079.5A
Other languages
Chinese (zh)
Inventor
高三元
李敬
卢绪鹏
李博文
闫卫坡
王志斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Kuandeng Zhiyun Technology Co ltd
China United Network Communications Corp Ltd Ningbo Branch
Original Assignee
Guizhou Kuandeng Zhiyun Technology Co ltd
China United Network Communications Corp Ltd Ningbo Branch
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Kuandeng Zhiyun Technology Co ltd, China United Network Communications Corp Ltd Ningbo Branch filed Critical Guizhou Kuandeng Zhiyun Technology Co ltd
Priority to CN202311089079.5A priority Critical patent/CN116958927A/en
Publication of CN116958927A publication Critical patent/CN116958927A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/588Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to the technical field of recognition of short columns, solves the technical problem of high difficulty in recognition and detection of short columns on two sides of a road in the prior art, and particularly relates to a method and a device for recognizing short columns based on a BEV (binary image) graph, wherein the method comprises the following steps: s1, acquiring an image of any section of road photographed by a camera as an original image; s2, determining a target space range on the original image, further determining the ground expression of the target space range, and generating a BEV diagram of the target space. The application solves the problem of difficult detection caused by the dwarf column in the original image or the dwarf column in the image, and the stretching becomes long after generating the BEV image, so the difficulty of detection on the BEV image can be greatly reduced, and the recognition precision of the dwarf column is improved.

Description

Method and device for identifying short column based on BEV (binary image) graph
Technical Field
The application relates to the technical field of recognition of short columns, in particular to a method and a device for recognizing short columns based on BEV images.
Background
In the map updating process, it is often necessary to update road information. In the field of automatic driving as well, real-time acquisition of road condition information is also required. The two sides of the road are usually provided with a plurality of posts, milestones and the like, which are important constituent elements of the road information and need to be accurately identified. The identification of such pillars is currently based mainly on convolutional neural networks CNN.
In the prior art, the recognition of the columnar bodies at both sides of the road is directly input into the original image for recognition, and the problem of difficulty in recognizing small objects may exist, namely, when the objects are small, AP (average precision) of the object is low in recognition in the scheme. The main reason is that the visual range of the object is small, and the recognition effect may be poor under the condition of limited training data. The perspective view has the problem of small and big size, and in the physical world, a short column body far away from a camera is displayed on an image to be short, so that the detection difficulty is increased.
Disclosure of Invention
Aiming at the defects of the prior art, the application provides a method and a device for identifying short columns based on BEV (binary image) images, which solve the technical problem that the difficulty in identifying and detecting short columns on two sides of a road is high in the prior art.
In order to solve the technical problems, the application provides the following technical scheme: a method of identifying short columns based on BEV maps, the method comprising the steps of:
s1, acquiring an image of any section of road photographed by a camera as an original image;
s2, determining a target space range on the original image, further determining the ground expression of the target space range, and generating a BEV image of the target space, wherein the BEV image is a bird' S eye view image;
s3, identifying the BEV image by adopting an identification model, and projecting the top points and the bottom points of the short columnar bodies in the BEV image and an accurate boundary Mask onto an original image to obtain an identification result of the short columnar bodies in the original image, wherein the identification model comprises a Mask-R-CNN network and a DETR model;
the Mask-R-CNN network is used for extracting the short columns in the BEV graph and the corresponding accurate boundary Mask;
the DETR model is used to identify vertices and nadirs representing dwarf columns in BEV graphs;
s4, outputting the recognition result of the short columnar body in the image.
Further, in step S2, the specific process includes the following steps:
s21, determining an outward expansion boundary by using two side boundaries of a road contained in an original image, and expanding the outward expansion boundary by 1m-2m usually by taking the road boundary as a starting point;
s22, taking the distance between the two side expansion boundaries as the width range w of the BEV image, taking the distance H extending forwards from the current camera position as the height range of the BEV image, and representing the Z coordinates of all pixel points in the target space range as ground expression-H, wherein the resolution of the BEV image is determined by the width range and the height range and combining the distances represented by each pixel;
s23, defining BEV grids, wherein the width and height of the BEV grid area are width and height respectively;
s24, calculating the pixel coordinates of each point in the BEV image in the original image according to the corresponding relation between the world coordinate system coordinates and the pixel coordinates, sampling by using a bilinear interpolation method to obtain RGB color values of each point, and generating the BEV image.
Further, in step S22, determining the ground expression includes two ways, respectively:
assuming the ground of the road as a plane, and the Z coordinates of all points in the target space range as-H, the ground is expressed as-H, and H is the camera height;
and obtaining accurate ground by using a ground extraction algorithm, wherein Z coordinates of all points in the target space range are true values.
Further, in step S3, the Mask-R-CNN network is used to extract the short columns in the BEV graph and the corresponding accurate boundary Mask, and the specific process includes the following steps:
s311, inputting the BEV image into a pre-trained neural network to obtain a corresponding feature map, wherein the pre-trained neural network comprises, but is not limited to, resNet, regNet, HRNet;
s312, setting a preset region of interest (ROI), namely a columnar body, for each point in the feature map so as to obtain a plurality of candidate ROIs;
s313, sending the multiple candidate regions of interest (ROIs) into an RPN (remote procedure network) to perform binary classification and BB regression, and filtering out a part of the candidate regions of interest (ROIs);
s314, performing ROIALign operation on the region of interest ROI of the residual binding box regression result after filtering;
s315, performing N-type classification, BB regression and accurate boundary mask generation on the candidate region of interest ROI subjected to the ROIAlign operation.
Further, the loss function L of the Mask-R-CNN network is expressed as:
L=L cls +L box +L mask
in the above, L cls To classify losses, L box For binding box loss, L mask Is binary cross entropy lossLoss of function.
Further, in step S3, the DETR model is used to identify vertices and nadirs representing dwarf columns in the BEV map, and the specific process includes the steps of:
s321, inputting a short column body and a corresponding accurate boundary mask into the DETR model to obtain a feature matrix of the BEV graph;
s322, straightening the BEV graph containing the feature matrix and adding position codes;
s323, inputting the BEV diagram obtained in the step S322 into the correlation information of the learning features in Transformer encoder;
s324, taking the output of the decoder and the object query as the input of the decoder to obtain decoded information;
s325, transmitting the decoded information into a feedforward neural network FFN to obtain prediction information;
s326, judging whether the prediction information output by the feedforward neural network FFN contains a target object of a columnar body;
if so, outputting vertexes and bottom points corresponding to all the columnar objects;
if not, the no object class is output.
By means of the technical scheme, the application provides a method and a device for identifying a short column based on BEV (binary image) images, which at least have the following beneficial effects:
1. the application is different from the general method of directly using the original image to detect the target by adopting the BEV image, and firstly, the original image is deformed to generate the BEV image, so that the problem of detection difficulty caused by the dwarf columnar body in the original image or the dwarf columnar body in the image is solved, and the tensile elongation can be obtained after the BEV image is generated, thereby greatly reducing the difficulty of detection on the BEV image and improving the recognition precision of the dwarf columnar body.
2. According to the application, the definition of the target space range is carried out through the original image, the two sides of the road boundary are outwards expanded, all target columns on the two sides of the road are effectively contained, and the full range coverage can be carried out on the columns such as road piles, milestones and the like contained in the actual road scene, so that the recognition of the short columns in the scene is effectively carried out.
3. The application extracts the short column in the BEV diagram and the corresponding accurate boundary Mask as the characteristics through the Mask-R-CNN network, then adopts the DETR model to identify the top points and the bottom points in the characteristics, has simple structure by the column, can reduce redundant information by using the top points and the bottom points for representation, and directly outputs the top points and the bottom points of the column based on the DETR algorithm, thereby being convenient and concise for the representation of the column identification.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of a method of identifying dwarf columns according to the present application;
FIG. 2 is a schematic illustration of defining a target spatial range in an original image according to the present application;
FIG. 3 is a width and height schematic of a BEV plot of the present application;
FIG. 4 is a BEV plot generated from a target spatial range in an original image in accordance with the present application;
FIG. 5 is a network architecture diagram of the Mask-R-CNN network of the present application;
FIG. 6 is a diagram of a network architecture of the DETR model of the present application;
FIG. 7 is a graph of the recognition results of the vertices and nadir of the dwarf pillars in the BEV graph of the present application;
FIG. 8 is a graph of the recognition result of the top and bottom points of the inventive dwarf column in the original image;
FIG. 9 is a block diagram showing the structure of the recognition device for a short column body according to the present application.
In the figure: 10. an original image acquisition module; 20. a BEV diagram generation module; 30. an identification module; 40. and an output module.
Detailed Description
In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. Therefore, the realization process of how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.
In the map updating process, it is often necessary to update road information. In the field of automatic driving as well, real-time acquisition of road condition information is also required. The two sides of the road are usually provided with a plurality of posts, milestones and the like, which are important constituent elements of the road information and need to be accurately identified. At present, the recognition of the columnar body is mainly based on a convolutional neural network CNN, and the following schemes are generally adopted:
1. the bounding box of the object, i.e. the bounding box, is output using a single-stage or multi-stage neural network, such as the YOLO, the fast-R-CNN series, etc. models.
2. On the basis of 1, the capability of dividing object instances is increased, and object bounding boxes can be simultaneously identified and accurate boundary masks, i.e. masks, of each object can be generated.
3. The expression mode of the abstract object is polygon or point line, and the expression of the object is directly output in a mode of like DETR.
All the schemes above directly input original images (data collected by outdoor automatic driving vehicles and map collecting vehicles) into a network to identify objects, and the problem of difficult identification of small objects possibly exists, namely: when the object is small, the above scheme recognizes that AP (average precision) of the object is low. The main reason is that the visual range of the object is smaller, and the recognition effect is possibly poorer under the condition that the training data is limited; the perspective view has the problem of small and big size, and in the physical world, a short column body far away from a camera is displayed on an image to be short, so that the detection difficulty is increased.
The method aims at the problem that the imaging of the columnar body far away from the camera on the original image is small, and the original image is effectively deformed to be stretched and amplified, so that the network learning is facilitated, and the accurate recall rate of the object is improved.
Referring to fig. 1-9, a specific implementation manner of the present embodiment is shown, in this embodiment, a target space range is defined on an original image, and then a corresponding BEV image is generated by deforming an area where the target space range is located, which is used for stretching an object far from a camera in the original image to be larger, so that a column in the image can be conveniently identified, meanwhile, an object with a consistent physical size is kept consistent in the BEV image, a near-large and far-small phenomenon is not generated, and after an identification result of the column in the BEV image is obtained, an identification result of a short column in the original image is obtained through a corresponding relationship between the BEV image and the original image.
Referring to fig. 1, the present embodiment provides a method for identifying a dwarf column based on a BEV chart, which includes the following steps:
s1, acquiring an image of any road section taken by a camera as an original image, wherein the image of the original image can be taken by the camera or other shooting equipment, the taken image needs to be clearly visible, and the image comprises scenes outside two sides of the road, such as the image shown in fig. 2.
S2, determining a target space range on the original image, further determining the ground expression of the target space range, and generating a BEV diagram of the target space; the BEV diagram is collectively referred to as Bird's-eye-view, i.e., a Bird's eye view perspective, as shown in FIG. 4.
As shown in fig. 2, the target space range determined in the original image is a rectangular area in the image, in fig. 2, the boundary of the target space range is a distance of 1m-2m extended outwards by the boundary of two sides of the road, the target space range is defined, the two sides of the road boundary are extended outwards, all the target columns of two sides of the road are effectively contained, and the whole range coverage can be performed on the columns such as road piles, milestones and the like contained in the actual road scene, so that the short columns in the scene can be effectively identified, and in step S2, the specific process comprises the following steps:
s21, determining an outward expansion boundary by taking two side boundaries of the road contained in the original image, and expanding the outward expansion boundary by 1m-2m by taking the road boundary as a starting point.
S22, taking the distance between the two side expansion boundaries as the width range w of the BEV graph and taking the distance h extending forwards from the current camera position as the height range of the BEV graph. The Z-coordinate of all pixels within the target spatial range is denoted as-H, the resolution of the BEV map, determined by the aforementioned width range and height range in combination with the distance represented by each pixel.
For example, if r represents the length of each pixel, the width=w/r, and height=h/r of the BEV map. For each pixel (u, v) on the BEV map, its corresponding spatial coordinate is (min_x+u×r, min_y+v×r, -H), min_x, min_y being the minimum of the x-coordinate and the y-coordinate of the target spatial range, respectively; determining the ground expression includes two ways, respectively:
assuming the ground of the road as a plane, and the Z coordinates of all points in the target space range as-H, the ground is expressed as-H, and H is the camera height;
the ground extraction algorithm is used to obtain accurate ground, the Z coordinates of all points in the target space range are the true values Z=f (x, y), (x, y, f (x, y)) and the coordinates of all points in the target space range are based on the coordinates obtained by the ground extraction algorithm, and the coordinates are no longer a fixed value, but still centered on the camera, and relatively, under the condition that the condition is satisfied, the true values are used as ground to express more accurately, so that two ground expression modes can be selected according to practical situations.
S23, defining BEV grids, wherein the width and height of the BEV grid area are width and height respectively, as shown in FIG. 3.
S24, calculating pixel coordinates of each point in the BEV image in the original image according to the corresponding relation between the world coordinate system coordinates and the pixel coordinates, sampling by using a bilinear interpolation method to obtain RGB color values of each point, and generating the BEV image.
In step S24, the relationship between the three-dimensional spatial point and the two-dimensional image pixel can be expressed as:
wherein, X represents world coordinate system coordinates, R represents a rotation matrix of a camera coordinate system relative to the world coordinate system, T represents a translation vector of the camera coordinate system relative to the world coordinate system, K represents camera internal parameters, u and v represent pixel coordinates, and z is the z coordinate of a three-dimensional point under the camera coordinate system.
The camera coordinate system is to use the photographing center of the camera as the origin, the forward direction of the optical axis as the Z axis, the X axis and the Y axis parallel to the image plane, the right direction as the X axis and the downward direction as the Y axis, and follow the right hand coordinate system rule.
The world coordinate system is defined in three-dimensional space to describe the positions of the camera and other objects in the three-dimensional space, usually using a right-hand coordinate system, and the relationship between the camera coordinate system and the world coordinate system can be described by a rotation matrix and a translation vector.
Since the pixel physical coordinate system of the original image is x-axis right, y-axis down, z-axis forward, and the pixel physical coordinate system of the BEV image is x-axis right, y-axis forward, z-axis up, the two coordinate system origins are the same, therefore, there is only rotation R, there is no translation T, i.e., t=0, so the corresponding R and T are:
the pixel physical coordinate system is the camera coordinate system, and the world coordinate system is the pixel physical coordinate system of the BEV image.
The camera internal parameters K are:wherein f x 、f y C is the focal length of the camera x 、c y The principal point coordinates are in pixels.
According to the corresponding relation between the world coordinates and the pixel coordinates, the pixel coordinates of each point in the BEV grid in the original image are calculated through a formula (1), a bilinear interpolation method is used for sampling, RGB color values of each grid point are obtained, and a BEV image is generated. As shown in fig. 4, in the BEV image generated by the above method for the target space range in fig. 2, in this embodiment, the BEV image is generated by deforming the original image first, unlike the general method of directly using the original image to perform target detection, because the column body in the original image far from the camera is short in the image, the detection is difficult, and the BEV image can be stretched and lengthened, so that the difficulty in detecting on the BEV image can be reduced, and the recognition accuracy of the short column body can be improved.
S3, identifying the BEV image by adopting an identification model, and projecting the top points and the bottom points of the short columnar bodies in the BEV image and an accurate boundary Mask onto an original image to obtain an identification result of the short columnar bodies in the original image, wherein the identification model comprises a Mask-R-CNN network and a DETR model; the Mask-R-CNN network is used for extracting the short columns in the BEV graph and the corresponding accurate boundary Mask, as shown in FIG. 5, which is a network architecture graph of the Mask-R-CNN network, and the Mask-R-CNN is an example segmentation algorithm, which is an extension of the Faster-R-CNN, and outputs a high-quality example segmentation Mask while detecting the target. The Mask-R-CNN has the advantages of high speed, high accuracy, simplicity, intuitiveness and the like, and has good effects on the aspects of target detection, target instance segmentation and target key point detection at present.
The Mask-R-CNN adopts the same two-step walking strategy as the Faster-R-CNN, and is different from the multitask regression using classification and regression in the Faster-R-CNN, the Mask-R-CNN is added with a Mask loss function for semantic segmentation on the basis of the Mask-R-CNN, so the loss function L of the Mask-R-CNN network can be expressed as:
L=L cls +L box +L mask
in the above, L cls To classify losses, L box For binding box loss, L mask Is a binary cross entropy loss. Mask-R-CThe NN network is used for extracting the short columns in the BEV graph and the corresponding accurate boundary mask, and the specific process comprises the following steps of:
s311, inputting the BEV image into a pre-trained neural network to obtain a corresponding feature map, wherein the pre-trained neural network comprises, but is not limited to, resNet, regNet, HRNet;
s312, setting a preset region of interest (ROI), namely a columnar body, for each point in the feature map so as to obtain a plurality of candidate ROIs;
s313, sending the multiple candidate regions of interest (ROIs) into an RPN network to perform binary classification (foreground or background) and BB regression, and filtering out a part of the candidate regions of interest (ROIs);
s314, performing ROIALign operation on the region of interest ROI of the residual binding box regression result after filtering, namely, firstly, corresponding the pixel of the BEV image and the pixel of the feature map, then, corresponding the feature map and the fixed feature, wherein ROIALign operation refers to image numerical values on pixel points with floating points obtained by a bilinear interpolation method in regional feature aggregation operation by canceling quantization operation, so that the whole feature aggregation process is converted into a continuous operation, an interpolation process is introduced, firstly, 14 x 14 is interpolated through bilinear interpolation, and then, the problem of Misalign brought by direct sampling only through the Pooling is solved, the pixel can be aligned by ROIALign operation, and the accuracy requirement of image semantic segmentation is met.
S315, performing N-type classification, BB regression and accurate boundary mask generation on the candidate regions of interest ROI subjected to the ROIAlign operation, specifically performing FCN operation in each candidate region of interest ROI, namely performing convolution operation through a full convolution neural network FCN, so as to generate the accurate boundary mask.
The Mask-R-CNN network can effectively detect the regions of interest ROI in the BEV map, and simultaneously generate a high-quality accurate boundary Mask for each region of interest ROI, so that the precision and speed of feature extraction of the short columnar body are improved, and the generated accurate boundary Mask has the same size and pixel value range as the original image for subsequent processing and analysis.
The Mask-R-CNN network adopts the FPN feature pyramid, and in the previous detection, the Fast-R-CNN and the ROI are all in the last layer, so that the problem of large target detection is solved, but the precision coefficient is insufficient for small target detection. Because for small objects, when the convolution is pooled to the last layer, there is no semantic information in practice, because the method of mapping the ROI to a feature map is to divide the underlying coordinates by the stride, it will be apparent that there is little or no mapping to the feature map. So to solve the multi-scale detection problem, a feature pyramid network is introduced. FPN is to naturally utilize CNN layers to fuse shallow layers with high resolution to have high semantic features.
The Mask-R-CNN network adopts ROIAlign, and a region proposal is assumed to exist in the original diagram, and the size is 665 x 665, so that the region proposal is mapped to the size in the feature diagram: 665/32=20.78, i.e. 20.78×20.78, at which point no rounding operation is performed as ropooling, but floating point numbers are reserved. The bilinear difference method is adopted, because if the output size of the roiling is 7x7, if the ROI size of the RON network output is 8 x 8, it cannot be guaranteed that the input pixels and the output pixels are in one-to-one correspondence, firstly, the information amount contained in them is different, and 1 to 1, and 1 to 2 are included, and secondly, their coordinates cannot be corresponding to the input.
The Mask-R-CNN network introduces semantic segmentation branches, so that decoupling of the relation between Mask and class prediction is realized, the Mask branches only perform semantic segmentation, and the task of type prediction is handed to the other branch. This is different from the original FCN network, and the original FCN also predicts the class to which the mask belongs at the same time when predicting the mask, i.e., implements the N class classification in step S35.
As shown in fig. 6, a network architecture diagram of the DETR model, DETR (DEtection TRansformer), which is a transducer-based end-to-end object detection network proposed by Facebook, is published in ECCV2020. The use of a transducer in the field of target detection has been rapidly widespread since the 2017 proposal. The NMS and Anchor designs are removed by the DETR, so that the pipeline of target detection is greatly simplified.
In this embodiment, the DETR model is used to represent the column body with a top point and a bottom point to complete detection and identification, and in this embodiment, the loss function of the DETR model for matching is a hungarian algorithm, and the loss function L for training the DETR model is expressed as:
L=λL cls +αL P2P
wherein L is cls For classification loss, λ and α have values of 0.5.
Wherein P represents polylines (dotted line shape), each columnar body is represented by a line segment (top and bottom, two end points of bottom point and connection line thereof), S represents N columnar bodies, c i Any one of the columnar bodies is represented,representing the predicted value of each column vertex or nadir, v j Representing the true value of each column top or bottom point, D Manhattan Representing manhattan distance.
Identifying vertices and nadirs in the BEV map for representing the dwarf columns using the DETR model; the specific process comprises the following steps:
s321, inputting a short column body and a corresponding accurate boundary mask into the DETR model to obtain a feature matrix of the BEV graph;
s322, straightening the BEV graph containing the feature matrix and adding position codes;
s323, inputting the BEV diagram obtained in the step S322 into the correlation information of the learning features in Transformer encoder;
s324, taking the output of the decoder and the object query as the input of the decoder to obtain decoded information;
s325, transmitting the decoded information into a feedforward neural network FFN to obtain prediction information;
s326, judging whether the prediction information output by the feedforward neural network FFN contains a target object of a columnar body;
if so, outputting vertexes and bottom points corresponding to all the columnar objects;
if not, the no object class is output.
As shown in fig. 7, in order to identify the top and bottom points of the short columns in the BEV diagram by the DETR model, taking the BEV diagram shown in fig. 4 as an example, the present embodiment extracts the short columns in the BEV diagram and the corresponding accurate boundary Mask as features by using the Mask-R-CNN network, and then identifies the top and bottom points in the features by using the DETR model, so that the column structure is simple, and redundant information can be reduced by using the top and bottom points for representation, and the top and bottom points of the column are directly output based on the DETR algorithm, so that the representation for the column identification is convenient and concise.
And S5, projecting the top points and the bottom points of the short columnar bodies in the BEV image onto the original image to obtain the identification result on the original image. In this step, coordinates corresponding to the top and bottom points of the dwarf columnar body identified in the BEV graph are calculated according to the correspondence described in formula (1), so as to obtain the identification result of the columnar body in the original image, as shown in fig. 8, which is the identification result of the top and bottom points of the dwarf columnar body in the original image.
According to the method, the object with the same physical size is consistent in the BEV image, the near-large and far-small phenomenon can not be generated, and after the recognition result of the columnar body in the BEV image is obtained, the recognition result of the short columnar body in the original image is obtained through the corresponding relation between the BEV image and the original image.
The present embodiment also provides a device for a method of identifying a dwarf column, corresponding to the method of identifying a dwarf column provided in the above embodiment, and since the device for identifying a dwarf column provided in the present embodiment corresponds to the method of identifying a dwarf column provided in the above embodiment, the implementation of the method for identifying a dwarf column described above is also applicable to the device for identifying a dwarf column provided in the present embodiment, and will not be described in detail in the present embodiment.
Referring to fig. 9, a block diagram of a device for identifying a small pillar provided in this embodiment includes an original image acquisition module 10, a BEV diagram generation module 20, an identification module 30, and an output module 40, where:
the original image acquisition module 10 is used for acquiring an image of any road currently shot by the camera as an original image; the BEV map generation module 20 is configured to determine a target spatial range on the original image, and generate a BEV map based on the target spatial range based on the ground expression of the target spatial range; the recognition module 30 is configured to recognize the BEV graph by using a recognition model, and project the top and bottom points of the dwarf columnar body in the BEV graph onto the original image to obtain a recognition result of the dwarf columnar body in the original image, where the recognition model includes a Mask-R-CNN network and a DETR model; the Mask-R-CNN network is used for extracting the short columns in the BEV graph and the corresponding accurate boundary Mask; the DETR model is used to identify vertices and nadirs representing dwarf columns in BEV graphs; the output module 40 is used for outputting the recognition result of the short column in the image.
It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in a method of implementing an embodiment described above may be implemented by a program to instruct related hardware, and thus, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing embodiments have been presented in a detail description of the application, and are presented herein with a particular application to the understanding of the principles and embodiments of the application, the foregoing embodiments being merely intended to facilitate an understanding of the method of the application and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (7)

1. A method of identifying dwarf columns based on BEV maps, the method comprising the steps of:
s1, acquiring an image of any section of road photographed by a camera as an original image;
s2, determining a target space range on the original image, further determining the ground expression of the target space range, and generating a BEV image of the target space, wherein the BEV image is a bird' S eye view image;
s3, identifying the BEV image by adopting an identification model, and projecting the top points and the bottom points of the short columnar bodies in the BEV image and an accurate boundary Mask onto an original image to obtain an identification result of the short columnar bodies in the original image, wherein the identification model comprises a Mask-R-CNN network and a DETR model;
the Mask-R-CNN network is used for extracting the short columns in the BEV graph and the corresponding accurate boundary Mask;
the DETR model is used to identify vertices and nadirs representing dwarf columns in BEV graphs;
s4, outputting the recognition result of the short columnar body in the image.
2. The method for identifying short columns according to claim 1, wherein in step S2, the specific process comprises the steps of:
s21, determining an outward expansion boundary by using two side boundaries of a road contained in an original image, and expanding the outward expansion boundary by 1m-2m usually by taking the road boundary as a starting point;
s22, taking the distance between the two side expansion boundaries as the width range w of the BEV image, taking the distance H extending forwards from the current camera position as the height range of the BEV image, and representing the Z coordinates of all pixel points in the target space range as ground expression-H, wherein the resolution of the BEV image is determined by the width range and the height range and combining the distances represented by each pixel;
s23, defining BEV grids, wherein the width and height of the BEV grid area are width and height respectively;
s24, calculating the pixel coordinates of each point in the BEV image in the original image according to the corresponding relation between the world coordinate system coordinates and the pixel coordinates, sampling by using a bilinear interpolation method to obtain RGB color values of each point, and generating the BEV image.
3. The method of identifying short columns according to claim 1, wherein in step S22, determining the ground expression comprises two ways, respectively:
assuming the ground of the road as a plane, and the Z coordinates of all points in the target space range as-H, the ground is expressed as-H, and H is the camera height;
and obtaining accurate ground by using a ground extraction algorithm, wherein Z coordinates of all points in the target space range are true values.
4. The method for identifying short columns according to claim 1, wherein in step S3, mask-R-CNN network is used to extract short columns in BEV graph and corresponding exact boundary Mask, and the specific process comprises the following steps:
s311, inputting the BEV image into a pre-trained neural network to obtain a corresponding feature map, wherein the pre-trained neural network comprises, but is not limited to, resNet, regNet, HRNet;
s312, setting a preset region of interest (ROI), namely a columnar body, for each point in the feature map so as to obtain a plurality of candidate ROIs;
s313, sending the multiple candidate regions of interest (ROIs) into an RPN (remote procedure network) to perform binary classification and BB regression, and filtering out a part of the candidate regions of interest (ROIs);
s314, performing ROIALign operation on the region of interest ROI of the residual binding box regression result after filtering;
s315, performing N-type classification, BB regression and accurate boundary mask generation on the candidate region of interest ROI subjected to the ROIAlign operation.
5. The method for identifying short columns according to claim 1 or 4, wherein the loss function L of the Mask-R-CNN network is expressed as:
L=L cls +L box +L mask
in the above, L cls To classify losses, L box For binding box loss, L mask Is a binary cross entropy loss.
6. The method of identifying short columns according to claim 1, characterized in that in step S3, the DETR model is used to identify vertices and vertices representing short columns in the BEV map, the specific process comprising the steps of:
s321, inputting a short column body and a corresponding accurate boundary mask into the DETR model to obtain a feature matrix of the BEV graph;
s322, straightening the BEV graph containing the feature matrix and adding position codes;
s323, inputting the BEV diagram obtained in the step S322 into the correlation information of the learning features in Transformer encoder;
s324, taking the output of the decoder and the object query as the input of the decoder to obtain decoded information;
s325, transmitting the decoded information into a feedforward neural network FFN to obtain prediction information;
s326, judging whether the prediction information output by the feedforward neural network FFN contains a target object of a columnar body;
if so, outputting vertexes and bottom points corresponding to all the columnar objects;
if not, the no object class is output.
7. An apparatus for carrying out the method for identifying a short columnar body as defined in any one of claims 1 to 6, comprising:
the original image acquisition module (10) is used for acquiring an image of any section of road photographed by the camera as an original image;
a BEV map generation module (20), the BEV map generation module (20) configured to determine a target spatial range on the original image, and to determine a BEV map based on the target spatial range corresponding to a ground expression of the target spatial range;
the recognition module (30) is used for recognizing the BEV image by adopting a recognition model, projecting the top points and the bottom points of the short columnar bodies in the BEV image onto the original image to obtain the recognition result of the short columnar bodies in the original image, wherein the recognition model comprises a Mask-R-CNN network and a DETR model;
the Mask-R-CNN network is used for extracting the short columns in the BEV graph and the corresponding accurate boundary Mask;
the DETR model is used to identify vertices and nadirs representing dwarf columns in BEV graphs;
and the output module (40) is used for outputting the identification result of the short column in the image.
CN202311089079.5A 2023-08-28 2023-08-28 Method and device for identifying short column based on BEV (binary image) graph Pending CN116958927A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311089079.5A CN116958927A (en) 2023-08-28 2023-08-28 Method and device for identifying short column based on BEV (binary image) graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311089079.5A CN116958927A (en) 2023-08-28 2023-08-28 Method and device for identifying short column based on BEV (binary image) graph

Publications (1)

Publication Number Publication Date
CN116958927A true CN116958927A (en) 2023-10-27

Family

ID=88449321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311089079.5A Pending CN116958927A (en) 2023-08-28 2023-08-28 Method and device for identifying short column based on BEV (binary image) graph

Country Status (1)

Country Link
CN (1) CN116958927A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117455923A (en) * 2023-12-26 2024-01-26 通达电磁能股份有限公司 Insulator defect detection method and system based on YOLO detector

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117455923A (en) * 2023-12-26 2024-01-26 通达电磁能股份有限公司 Insulator defect detection method and system based on YOLO detector
CN117455923B (en) * 2023-12-26 2024-03-15 通达电磁能股份有限公司 Insulator defect detection method and system based on YOLO detector

Similar Documents

Publication Publication Date Title
CN111797716B (en) Single target tracking method based on Siamese network
CN109903331B (en) Convolutional neural network target detection method based on RGB-D camera
CN110688905B (en) Three-dimensional object detection and tracking method based on key frame
CN113516664A (en) Visual SLAM method based on semantic segmentation dynamic points
CN111998862B (en) BNN-based dense binocular SLAM method
US11790661B2 (en) Image prediction system
CN111144213A (en) Object detection method and related equipment
CN110516639B (en) Real-time figure three-dimensional position calculation method based on video stream natural scene
CN113139602A (en) 3D target detection method and system based on monocular camera and laser radar fusion
Wang et al. MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection
CN116958927A (en) Method and device for identifying short column based on BEV (binary image) graph
CN108710879B (en) Pedestrian candidate region generation method based on grid clustering algorithm
CN111476089A (en) Pedestrian detection method, system and terminal based on multi-mode information fusion in image
CN104463962B (en) Three-dimensional scene reconstruction method based on GPS information video
CN111626241A (en) Face detection method and device
CN113065506B (en) Human body posture recognition method and system
CN111198563B (en) Terrain identification method and system for dynamic motion of foot type robot
CN112950786A (en) Vehicle three-dimensional reconstruction method based on neural network
Chen et al. Stingray detection of aerial images with region-based convolution neural network
CN112529917A (en) Three-dimensional target segmentation method, device, equipment and storage medium
CN116664851A (en) Automatic driving data extraction method based on artificial intelligence
CN111738061A (en) Binocular vision stereo matching method based on regional feature extraction and storage medium
CN116758148A (en) SLAM method and system in dynamic environment
CN113920254B (en) Monocular RGB (Red Green blue) -based indoor three-dimensional reconstruction method and system thereof
Han et al. GardenMap: Static point cloud mapping for Garden environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination