CN114565900A - Target detection method based on improved YOLOv5 and binocular stereo vision - Google Patents

Target detection method based on improved YOLOv5 and binocular stereo vision Download PDF

Info

Publication number
CN114565900A
CN114565900A CN202210055550.8A CN202210055550A CN114565900A CN 114565900 A CN114565900 A CN 114565900A CN 202210055550 A CN202210055550 A CN 202210055550A CN 114565900 A CN114565900 A CN 114565900A
Authority
CN
China
Prior art keywords
module
yolov5
attention
network
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210055550.8A
Other languages
Chinese (zh)
Inventor
黎国溥
陈升东
袁峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Institute of Software Application Technology Guangzhou GZIS
Original Assignee
Guangzhou Institute of Software Application Technology Guangzhou GZIS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Institute of Software Application Technology Guangzhou GZIS filed Critical Guangzhou Institute of Software Application Technology Guangzhou GZIS
Priority to CN202210055550.8A priority Critical patent/CN114565900A/en
Publication of CN114565900A publication Critical patent/CN114565900A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S15/00Systems using the reflection or reradiation of acoustic waves, e.g. sonar systems
    • G01S15/86Combinations of sonar systems with lidar systems; Combinations of sonar systems with systems not using wave reflection
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S15/00Systems using the reflection or reradiation of acoustic waves, e.g. sonar systems
    • G01S15/88Sonar systems specially adapted for specific applications
    • G01S15/93Sonar systems specially adapted for specific applications for anti-collision purposes
    • G01S15/931Sonar systems specially adapted for specific applications for anti-collision purposes of land vehicles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Probability & Statistics with Applications (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target detection method based on improved YOLOv5 and binocular stereo vision, wherein an SE module is added in a backbone network of YOLOv5, characteristic information aiming at a channel is screened, and characteristic expression capability is improved; the attention module CBAM is partially fused with the Neck part of the YOLOv5 network, so that the capability of the model for extracting features is enhanced, and the model focuses more on the detected target; the CIOU loss function is used for replacing the regression loss function of the original detection frame, so that the problems of low positioning precision and low regression speed of the target detection frame in the training process are solved; replacing the original NMS with the DIOU-NMS to improve the target detection precision under the shielding condition; binocular correction, stereo matching and binocular ranging; analyzing ultrasonic radar data; and performing fusion calculation, outputting sensing information such as type, confidence coefficient, three-dimensional coordinates, physical dimension length and width and the like, and improving the safety.

Description

Target detection method based on improved YOLOv5 and binocular stereo vision
Technical Field
The invention relates to the technical field of automatic driving in intelligent transportation, in particular to a target detection method based on improved YOLOv5 and binocular stereo vision.
Background
Many people die from traffic accidents every year in China, so that the families suffer from the pain of family casualties. In recent years, automatic driving aiming at a significant reduction in accidents has become a social need. The core of an automatic driving system can be divided into three parts: sensing, planning and controlling. The perception is that information is collected from the driving environment of the vehicle and relevant knowledge is extracted for later planning and control, and is a basic link in the implementation process of the automatic driving technology. The three-dimensional target detection is an important branch of an environment perception system in the field of automatic driving, and has very important research significance for promoting traffic safety.
The traditional target detection method is mainly based on feature learning, when region selection is carried out, time complexity is high due to image traversal, and meanwhile robustness is poor due to form diversity, illumination diversity and background diversity during feature extraction. In order to overcome the limitations of the traditional machine learning method, scholars extract a Convolutional Neural Network (CNN) based on deep learning. Compared with the traditional method, the convolutional neural network can accurately extract proper features without additionally designing specific features. Detection methods based on convolutional neural networks can be divided into two main categories, namely one-stage and two-stage. The two-stage method represented by fast R-CNN uses RPN to generate an advice frame on a feature level by sharing convolution features, and then uses the convolution features of an advice frame area to classify and position and learn a target frame, so that the method has the characteristics of high precision and low speed. In a one-stage target detection algorithm represented by YOLO, a positioning and identifying task of a target frame is predicted and completed at one time on an output layer according to regression logic, and the one-stage target detection algorithm is widely applied to a target detection task due to high detection speed.
The performance of the YOLOv5 as the latest version of the current YOLO network series is obviously improved compared with the performance of the previous version, but the detection precision is still not high enough under the current complex environment background.
In an automatic driving application scene, the two-dimensional target detection cannot provide all information required by sensing the environment; two-dimensional target detection can only provide the position of a target in a two-dimensional image and the confidence of a corresponding category; however, in the real three-dimensional world, objects have three-dimensional shapes, and most applications require information such as spatial coordinates and physical dimensions of objects.
In vision-based target detection, due to the limitation of the field angle of a camera, a certain blind area exists in the detection range.
Disclosure of Invention
In view of this, the invention provides a target detection method based on improved YOLOv5 and binocular stereo vision, which aims at the situation that in a low-speed automatic driving scene of a park, an automatic driving vehicle needs to quickly and accurately identify a front vehicle, a pedestrian, an obstacle and other targets so as to achieve autonomous obstacle avoidance in vehicle track cruising.
The invention solves the problems through the following technical means:
a target detection method based on improved YOLOv5 and binocular stereo vision comprises the following steps:
adding an SE module to a backbone network of YOLOv5, fusing a focus module CBAM and a Neck part of a YOLOv5 network, replacing an original detection frame regression loss function with a CIOU loss function, replacing an original NMS with a DIOU-NMS, replacing an original SiLU activation function with a Hardswish activation function after convolution operation, training an improved YOLOv5 network, and performing target detection based on an improved YOLOv5 model;
calculating the parallax of the left camera image and the right camera image based on binocular stereo vision to obtain a depth map; then, the distance, the spatial coordinate and the physical size of the detected target are calculated by combining the target position, the category and the confidence coefficient in the two-dimensional image, so that perception information is enriched and the reliability is improved;
installing ultrasonic radars around the vehicle, analyzing the ultrasonic radar data, detecting whether an object exists in a short distance, and making up a blind area in a short distance of binocular stereoscopic vision;
performing fusion calculation and outputting a sensing result, wherein the fusion calculation comprises the following steps: type, confidence, three-dimensional coordinates, physical dimension length and width.
Further, the structure of YOLOv5 is divided into four parts, including an Input end, a Backbone network, a Neck network and a Head output end;
the Input end contains preprocessing of data, including Mosaic data enhancement, adaptive image filling, and in order to accommodate different data sets, YOLOv5 integrates adaptive anchor frame calculation at the Input end to automatically set the initial anchor frame size when a data set is changed;
the Backbone network of the backhaul extracts features of different levels from an image through deep convolution operation, and utilizes a Bottleneck cross-stage local structure Bottleneck CSP and a spatial pyramid pooling SPP, wherein the purpose of the Bottleneck cross-stage local structure is to reduce the calculated amount and improve the reasoning speed, and the purpose of the Bottleneck cross-stage local structure Bottleneck CSP and the spatial pyramid pooling SPP is to realize feature extraction of different scales on the same feature map, which is beneficial to improvement of detection precision;
the Neck network layer comprises a characteristic pyramid FPN and a path aggregation structure PAN, the FPN transmits semantic information from top to bottom in the network, the PAN transmits positioning information from bottom to top, information of different network layers in the backhaul is fused, and the detection capability is further improved;
the Head output end is used as a final detection part for predicting targets with different sizes on feature maps with different sizes.
Further, adding an SE module to the YOLOv5 backbone network specifically includes:
the SE module adopts a one-dimensional vector with the same number as the channels as the evaluation score of each channel, and then applies the evaluation score to the corresponding channels respectively to process the output feature mapping; the SE module is used for learning the correlation among the channels, screening out the characteristic information aiming at the channels and improving the characteristic expression capability;
adding an SE module behind an SPP module in a backbone network;
setting input W X H X C image data, wherein W is image width, H is image height, and C is image channel number;
the SE module firstly performs a compressed Squeeze operation on the input, wherein the process is a global average pooling operation, and the feature map becomes a vector of 1x1xC after the global average pooling operation;
then, Excitation operation is carried out on the vector of 1x1xC, the input vector of 1x1xC is changed into 1x1xCxSERadio through a full connection layer, the SERadio is a scaling factor and plays a role in reducing the number of channels and optimizing calculated quantity, and then 1x1xC and an activation function are obtained through the activation function through the full connection layer again to obtain 1x1 xC;
finally, the branch is processed by a scale, the input of the scale is the excited output 1x1xC and the input W x H x C of the whole module, and the channel weights of the excited output 1x1xC and the input W x H x C are multiplied, and the output of the weight value of each channel corresponding to the characteristic diagram input by the SE module is obtained by the SE module.
Further, fusing the attention module CBAM with the Neck part of the YOLOv5 network specifically includes:
in the CNN network, an attention mechanism is used on the feature map for acquiring available attention information in the feature map, including spatial attention and channel attention information; the convolution attention module CBAM pays attention to space and channel information at the same time, and reconstructs a feature map in the middle of the network through the channel attention module CAM and the space attention module SAM, emphasizes important features, inhibits general features and achieves the aim of improving the target detection effect;
the feature map obtained through the network is input to a CBAM module, the CBAM module is divided into two parts, the input feature map is firstly convoluted and then is sent to a channel attention module in the CBAM module, and finally feature adjustment is carried out on the input feature map through a space attention module to obtain the output of the whole module;
in the convolution operation of YOLOv5, a layer of output three-dimensional feature graph F is provided, wherein the number of channels C, the height H and the width W are F epsilon RcxHxW(ii) a The CBAM deduces a one-dimensional channel attention Mc and a two-dimensional space attention Ms from the feature map F in sequence, and the feature maps are multiplied element by element to obtain an output feature map of the F channel dimension;
setting Mc(F) Representing the CAM module to carry out channel attention reconstruction on the feature map and outputting an F' feature map; m is a group ofs(F ') represents the spatial attention reconstruction of the channel attention output F' by the SAM module;
Figure BDA0003476348550000041
representing element-by-element multiplication; is formulated as follows:
Figure BDA0003476348550000042
Figure BDA0003476348550000051
the channel attention module CAM is used for simultaneously carrying out maximum pooling and average pooling on each channel of the input feature map F, enabling the obtained intermediate vector to pass through a multilayer perceptron MLP, finally carrying out element-by-element addition on the feature vectors output by the multilayer perceptron MLP, carrying out Sigmoid activation operation, taking a scaling factor through a Sigmoid activation function, multiplying the scaling factor by the input feature map to obtain the output of the space attention module, and finally obtaining a channel attention feature map F'; in order to reduce the calculation amount, only one hidden layer is designed for the multi-layer perceptron MLP;
the space attention module SAM is used for performing maximum pooling and average pooling on the channel attention module output F 'along the channel direction, splicing the two operated outputs, acquiring a scaling factor of the space attention module through a convolution and sigmoid activation function, multiplying the scaling factor of the space attention module and the output of the channel attention module to obtain the output of the space attention module, and finally obtaining a space attention feature map F';
finally, adding the outputs F 'and F' of the two module groups and the input of the CBAM module to obtain a new characteristic of the whole CBAM module output;
the most important operation of the attention mechanism is to highlight important information in the characteristic diagram and suppress general information; extracting the most critical part of the features from the YOLOv5 network to be in the Backbone network Backbone, therefore, fusing the CBAM module to the partial output of the Backbone network Backbone and before the features of the Neck network are fused, completing the feature extraction in the Backbone network by the design consideration, predicting and outputting on different feature maps after the Neck feature fusion, and performing attention reconstruction on the CBAM module to play a role in starting from the top;
and a CBAM module is added before the three Neck feature fusion branches respectively, so that important information in the feature maps is highlighted, prediction output on different feature maps is realized through later further feature extraction, and the purpose of improving the target detection effect is achieved.
Further, replacing the original regression loss function of the detection box with the CIOU loss function specifically includes:
in the aspect of a loss function, a CIOU loss function is adopted for regression of frame information; the IOU loss function considers the overlapping area of the detection frame and the target frame; the GIOU loss function solves the problem when the boundary frames are not overlapped on the basis of the IOU; the DIOU loss function considers the information of the center distance of the bounding box on the basis of the IOU; the CIOU loss function considers the scale information of the width-to-height ratio of the bounding box on the basis of the DIOU;
the GIOU calculates the area Ac of the intersection part of the two bounding boxes, namely the area of the minimum box which simultaneously comprises the prediction box and the real box; then obtaining an IOU through a union set of the two bounding boxes; finding out the area U of the intersected part of the two boundary frames, which is not in the area U of the two boundary frames and accounts for the proportion of the whole intersected part; finally, the ratio subtracted by the initial IOU is GIOU, and the formula is as follows:
Figure BDA0003476348550000061
LossGIOU=1-GIOU
LossGIOUis a GIOU loss function, the GIOU has a symmetrical interval and a value range of [ -1,1 [ -1 [ ]](ii) a Taking a maximum value of 1 when the two coincide, and taking a minimum value of-1 when the two do not intersect and are infinite; the GIOU not only focuses on the overlapping region, but also focuses on other non-overlapping regions, and can better reflect the contact ratio of the overlapping region and the non-overlapping region; however, in the loss in the bounding box regression, the GIOU does not consider the information of the center distance of the bounding box and the scale information of the width-to-height ratio of the bounding box;
the CIOU can simultaneously consider the overlapping area of the detection frame and the target frame, the center distance of the boundary frame and the width-height ratio of the boundary frame, accelerate the regression speed of the target detection frame in the training process and improve the positioning precision of the boundary frame; the CIOU formula is as follows:
Figure BDA0003476348550000062
b is a predicted bounding box, bgtAs a true bounding box, p2(b,bgt) Respectively representing Euclidean distances of central points of the prediction frame and the real frame; c represents the diagonal distance of the minimum closure area which can contain the prediction box and the real box at the same time; α is a weighting function, v is used to measure the similarity of the aspect ratios, and the formula for α and v is as follows:
Figure BDA0003476348550000063
Figure BDA0003476348550000064
the CIOU loss equation is as follows:
LossCIOU=1-CIOU
Figure BDA0003476348550000071
wherein, wgtIs the width of the real bounding box, hgtIs the height of the true bounding box, w is the width of the predicted bounding box, and h is the height of the predicted bounding box.
Further, replacing the original NMS with the DIOU-NMS is specifically:
the IOU in the NMS is modified into the DIOU by adopting DIOU-NMS, the overlapping area is analyzed in the inhibition criterion, and the distance of the central point between two rectangular frames is calculated, so that the method is more suitable for being used in a target detection task under the condition of road traffic scene shielding;
assuming that the algorithm model detects that a candidate box set is B and the corresponding category confidence set is s, the classification score of the prediction box M with the highest score is updated, and the formula is as shown in the specification:
Figure BDA0003476348550000072
RDIOUrefers to the result obtained by DIOU-NMS processing, the input of which is siAnd BiWhere i refers to the ith in a set for iterative computation; siIs the classification score; ε is the NMS threshold; highest scoring prediction box M and other boxes BiThe IOU-DIOU value of (B) is relatively small, BiScore value s ofiStill remaining, otherwise, when IOU-DIOU is greater than NMS threshold value, siThe value is set to 0, i.e., filtered out; for two rectangular frames with farther central points, the rectangular frames may be located on different objects, and therefore the rectangular frames cannot be directly deleted; the DIOU-NMS deletes the candidate box B by analyzing the IOU of the two rectangular boxes and the distance between the center pointsiAnd the accuracy of target detection under the shielding condition is improved.
Further, the Hardswish activation function formula is defined as follows
Figure BDA0003476348550000073
Where x is an input value.
Further, through binocular calibration, the internal reference of a left camera, the internal reference of a right camera, and the rotation parameter, the translation parameter, the tangential distortion and the radial distortion between the two cameras are obtained;
through binocular correction, lens deformation is eliminated, and a stereo camera pair is converted into a standard form; enabling the two images to be in the same object, wherein the two images have the same size and are horizontally arranged on the same straight line; the binocular correction mainly comprises 4 parts, firstly, an original image is input, then calibration parameters such as tangential distortion and radial distortion are obtained, and distortion is eliminated; and performing binocular correction through an algorithm program, and finally cutting the image to obtain the image in a standard form.
Furthermore, binocular stereo matching is to find corresponding points of left and right camera images, and is a semi-global matching algorithm by adopting an SGBM (generalized sparse broadcast multicast group) algorithm to carry out stereo matching; firstly, calculating matching cost, namely calculating the matching cost between two pixels in a left image and a right image; the larger the matching cost is, the lower the possibility that the two pixels are corresponding points is; and then carrying out cost aggregation and parallax calculation, and finally optimizing the parallax to generate a parallax map.
Further, binocular distance measurement is carried out, a disparity map, a base line and a focal length are given, and the corresponding position in world coordinates is calculated through triangulation, namely the distance Z is obtained;
parallax: d ═ xl-xr
Similar to the triangle principle:
Figure BDA0003476348550000081
Figure BDA0003476348550000082
Figure BDA0003476348550000083
wherein f is the focal length, which refers to the distance between the sensor and the lens; d is parallax, meaning that the same spatial point is at the left camera pixel point (x)l,yl) And the corresponding point (x) in the right camerar,yr) The difference of the corresponding x-coordinates; t refers to the distance between the lenses of the two cameras.
Compared with the prior art, the invention has the beneficial effects that at least:
1. the improved YOLOv5 model strengthens the capability of extracting features, improves the positioning accuracy, reduces the regression time of a target detection frame in the training process, improves the accuracy of target detection under the shielding condition and improves the recognition effect of a detector.
2. Based on the binocular stereo vision technology, information such as the distance, the space coordinates and the physical size of a detected target is provided, so that perception information is enriched, and reliability is improved.
3. 8 ultrasonic radar are installed around the vehicle to detect whether objects exist in a short distance or not, so that a blind area of a binocular stereoscopic vision in a short distance is made up, and safety is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of the key steps of the target detection method based on improved YOLOv5 and binocular stereo vision of the present invention;
FIG. 2 is a system flow diagram of the object detection method of the present invention based on improved YOLOv5 and binocular stereo vision;
FIG. 3 is a diagram of the YOLOv5 network architecture according to the present invention;
FIG. 4 is a block diagram of the YOLOv5 component of the present invention;
FIG. 5 is a block diagram of the improved YOLOv5 backbone network of the present invention;
FIG. 6 is a CBAM structural diagram of the convolution attention module of the present invention;
FIG. 7 is a block diagram of an improved YOLOv5 Neck network of the present invention;
FIG. 8 is a diagram of the improved YOLOv5 network architecture of the present invention;
FIG. 9 is a block diagram of the improved YOLOv5 module of the present invention;
FIG. 10 is a schematic diagram of the binocular range finding of the present invention;
FIG. 11 is a diagram of the target detection effect of the improved YOLOv5 model;
FIG. 12 is a depth effect map based on binocular stereo vision of the present invention, wherein the left side is a real-time camera image and the right side is a depth map;
FIG. 13 is a depth effect map based on binocular stereo vision of the present invention, wherein the left side is the real-time camera image and the right side is the depth map;
fig. 14 is a depth effect map based on binocular stereo vision of the present invention, in which the real-time camera image is on the left and the depth map is on the right.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It should be noted that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work based on the embodiments of the present invention belong to the protection scope of the present invention.
As shown in fig. 1, the present invention provides a target detection method based on improved YOLOv5 and binocular stereo vision, comprising the following steps:
adding an SE module to a backbone network of YOLOv5, fusing a focus module CBAM and a Neck part of a YOLOv5 network, replacing an original detection frame regression loss function with a CIOU loss function, replacing an original NMS with a DIOU-NMS, replacing an original SiLU activation function with a Hardswish activation function after convolution operation, training an improved YOLOv5 network, and performing target detection based on an improved YOLOv5 model;
calculating the parallax of the left camera image and the right camera image based on binocular stereo vision to obtain a depth map; then, the distance, the spatial coordinate and the physical size of the detected target are calculated by combining the target position, the category and the confidence coefficient in the two-dimensional image, so that perception information is enriched and the reliability is improved;
installing ultrasonic radars around the vehicle, analyzing ultrasonic radar data, detecting whether objects exist in a short distance, and making up blind areas in a short distance of binocular stereoscopic vision;
performing fusion calculation and outputting a sensing result, wherein the fusion calculation comprises the following steps: type, confidence, three-dimensional coordinates, physical dimension length and width.
The invention uses two sensors, namely a binocular camera and an ultrasonic radar, wherein the binocular camera can simultaneously output images of a left camera and a right camera, and the ultrasonic radar outputs information of a short-distance obstacle. The core processing comprises three parts, wherein the first part is used for carrying out target detection based on the improved YOLOv5 model; the second part is binocular correction, stereo matching and binocular ranging; the third part is to analyze the ultrasonic radar data; and finally, integrating the three parts of data to perform fusion calculation, and outputting sensing information such as type, confidence coefficient, three-dimensional coordinates, physical dimension length and width and the like. The system flow is shown in fig. 2.
The detection range is 1.2 m to 8.5 m based on an image vision model of binocular stereo vision and target detection. The ultrasonic radar can output the distance of the obstacle, and the detection range is 0.3 m to 2.5 m. After the image vision and the ultrasonic radar are fused, the perception information of the ultrasonic radar is used at a short distance, and the perception information of the image vision is used at a middle distance; and in the overlapping area of the two, reliable perception information is formed by adopting a priority and confidence strategy.
The fused ultrasonic information can make up a close-range blind area of the image vision model, and under the condition that weather conditions are poor or a camera is shielded, the ultrasonic radar detection information can provide close-range perception information, so that the system safety is improved.
A first part: improved YOLOv5 model
According to the original YOLOv5 framework analysis, the structure of YOLOv5 is mainly divided into four parts: input, Backbone network, heck Neck network and Head output, as shown in fig. 3.
The YOLOv5 network structure component diagram is shown in fig. 4:
the Input mainly contains preprocessing of data, including Mosaic data enhancement, adaptive image filling, and to accommodate different data sets, YOLOv5 integrates adaptive anchor frame calculation at the Input to automatically set the initial anchor frame size when a data set is changed.
The Backbone network of the backhaul extracts features of different levels from an image through deep convolution operation, and mainly utilizes a Bottleneck cross-stage local structure Bottleneck CSP and a spatial pyramid pooling SPP, wherein the purpose of the Bottleneck cross-stage local structure CSP is to reduce the calculated amount and improve the reasoning speed, and the purpose of the Bottleneck cross-stage local structure SPP is to extract features of different scales from the same feature map, so that the detection precision is improved.
The Neck network layer comprises a feature pyramid FPN and a path aggregation structure PAN, the FPN transmits semantic information from top to bottom in the network, the PAN transmits positioning information from bottom to top, information of different network layers in the backhaul is fused, and detection capability is further improved.
The Head output end serves as a final detection part and is mainly used for predicting targets with different sizes on feature maps with different sizes.
The improvement points are as follows:
1. according to the invention, an SE module is added in a backbone network of YOLOv5, the characteristic information aiming at the channel is screened, and the characteristic expression capability is improved.
2. In order to improve the recognition effect of the detector and solve the problem of low vehicle recognition rate, the attention mechanism is fused with the detection network, namely the attention module CBAM is fused with the Neck part of the YOLOv5 network, so that the capability of extracting features of the model is enhanced, and the model focuses more on the detected target.
3. According to the invention, the CIOU loss function is used for replacing the regression loss function of the original detection frame, so that the problems of low positioning precision and low regression speed of the target detection frame in the training process are solved.
4. The invention replaces the original NMS with the DIOU-NMS, improves the missing detection phenomenon caused by the shielding of the target and improves the target detection precision under the shielding condition.
5. The invention uses Hardswish activating function to replace the original SiLU activating function after convolution operation, and has the advantages of good numerical stability and high calculating speed.
According to the invention, an SE module is added in a backbone network of YOLOv5, the characteristic information aiming at the channel is screened, and the characteristic expression capability is improved.
And an SE (Squeeze-and-Excitation) module, which adopts a one-dimensional vector with the same number as the channels as the evaluation score of each channel, and then applies the evaluation scores to the corresponding channels respectively to process the output feature mapping. The SE module mainly has the effects of learning the correlation among the channels, screening out the characteristic information aiming at the channels and improving the characteristic expression capacity.
An SE module is added behind the SPP module in the backbone network, and the improved backbone network and component structure are shown in fig. 5.
Image data of input W H C is set, wherein W is the image width, H is the image height, and C is the number of image channels.
The SE module firstly performs a compression (Squeeze) operation on the input, wherein the process is a global average pooling operation, and the feature map becomes a vector of 1x1xC after the global average pooling operation;
and then carrying out Excitation (Excitation) operation on the vector of 1x1xC, wherein the input vector of 1x1xC is changed into 1x1xCxSERadio through a full connection layer, the SERadio is a scaling factor and plays a role in reducing the number of channels to optimize calculated quantity, and then obtaining 1x1xC and an activation function through the full connection layer again through the activation function to obtain 1x1 xC.
Finally, the branch is processed by a scale, the input of the scale is the excited output 1x1xC and the input W x H x C of the whole module, and the channel weights of the excited output 1x1xC and the input W x H x C are multiplied, and the output of the weight value of each channel corresponding to the characteristic diagram input by the SE module is obtained by the SE module.
The invention fuses the attention module CBAM and the Neck part of the YOLOv5 network
In CNN networks, attention is acted on the feature map for obtaining the attention information available in the feature map, mainly including spatial attention and channel attention information. The Convolutional Attention Module (CBAM) focuses on both spatial and channel information, and reconstructs a feature map in the middle of the network through a channel Attention Module cam (channel Attention Module) and a spatial Attention Module sam (spatial Attention Module), emphasizes important features, suppresses general features, and achieves the purpose of improving the target detection effect.
The feature map obtained through the network is input to the CBAM module, the CBAM module can be divided into two parts, the input feature map is firstly convoluted and then is sent to a channel attention module in the CBAM module, then the feature map is subjected to feature adjustment through a space attention module, and finally the feature map and the input feature map are subjected to feature adjustment to obtain the output of the whole module.
In the convolution operation of YOLOv5, a layer of output three-dimensional feature graph F is provided, wherein the number of channels C, the height H and the width W are as follows, i.e. F is equal to RCxHxW. The CBAM deduces the one-dimensional channel attention Mc and the two-dimensional space attention Ms from the feature map F in sequence, and the element-by-element multiplication is carried out respectively to finally obtain an output feature map with the F channel dimension.
Setting Mc(F) Representing the CAM module to carry out channel attention reconstruction on the feature map and outputting an F' feature map; ms(F ') represents the spatial attention reconstruction of the channel attention output F' by the SAM module;
Figure BDA0003476348550000141
representing element-by-element multiplication; is formulated as follows:
Figure BDA0003476348550000142
Figure BDA0003476348550000143
the structure of the convolution attention module CBAM, the channel attention module CAM and the spatial attention module SAM is shown in FIG. 6.
And the channel attention module CAM is used for simultaneously carrying out maximum pooling and average pooling on each channel of the input feature map F, carrying out element-by-element addition on the feature vectors output by the Multi-Layer Perceptron (MLP) to obtain intermediate vectors, carrying out Sigmoid activation operation, taking a scaling factor through a Sigmoid activation function, multiplying the scaling factor by the input feature map to obtain the output of the space attention module, and finally obtaining the channel attention feature map F'. In order to reduce the amount of calculation, the multi-layer perceptron MLP designs only one hidden layer.
And the space attention module SAM is used for respectively carrying out maximum pooling and average pooling on the channel attention module output F 'along the channel direction, splicing the two operated outputs, acquiring a scaling factor of the space attention module through a convolution and sigmoid activation function, multiplying the scaling factor of the space attention module and the output of the channel attention module to obtain the output of the space attention module, and finally obtaining a space attention feature map F'.
Finally, the outputs F ', F' of the two module groups are added with the input of the CBAM module to obtain the new characteristics of the output of the whole CBAM module.
The most important operation of the attention mechanism is to highlight important information in the characteristic diagram and suppress general information; the most critical part of the extracted features in the YOLOv5 network is in the Backbone network Backbone, therefore, the invention fuses the CBAM module in the partial output of the Backbone network Backbone and before the features of the Neck network are fused, the feature extraction is finished in the Backbone network in the design consideration, the output is predicted on different feature maps after the Neck feature fusion, the CBAM module carries out attention reconstruction at the position, the function of starting from the beginning can be played, and the structure after the concrete improvement is shown in fig. 7.
And a CBAM module is added before the three Neck feature fusion branches respectively, so that important information in the feature maps is highlighted, prediction output on different feature maps is realized through later further feature extraction, and the purpose of improving the target detection effect is achieved.
According to the invention, the CIOU loss function is used for replacing the regression loss function of the original detection frame, so that the problems of low positioning precision and low regression speed of the target detection frame in the training process are solved.
In the aspect of a loss function, the regression of the frame information adopts a CIOU loss function. The IOU penalty function takes into account the overlapping area of the detection box and the target box. The GIOU loss function solves the problem when the bounding boxes are not overlapped on the basis of the IOU. The DIOU loss function considers the information of the bounding box center distance based on the IOU. The CIOU loss function considers the scale information of the width-to-height ratio of the bounding box on the basis of the DIOU.
The GIOU calculates the area Ac of the intersection part of the two bounding boxes, namely the area of the minimum box which simultaneously comprises the prediction box and the real box; then obtaining an IOU through a union set of the two bounding boxes; finding out the area U of the intersected part of the two boundary frames, which is not in the area U of the two boundary frames and accounts for the proportion of the whole intersected part; finally, the initial IOU is subtracted from this ratio to obtain GIOU, which is expressed as follows:
Figure BDA0003476348550000151
LossGIOU=1-GIOU
LossGIOUis a GIOU loss function, the GIOU has a symmetrical interval and a value range of [ -1,1 [ -1 [ ]]. The maximum value is 1 when the two coincide, and the minimum value is-1 when the two do not intersect and are infinitely far. The GIOU focuses not only on the overlapping region but also on other non-overlapping regions, and can better reflect the degree of coincidence between the two regions. However, in the loss in bounding box regression, GIOU does not consider information on the bounding box center distance and scale information on the bounding box aspect ratio.
The CIOU can simultaneously consider the overlapping area of the detection frame and the target frame, the center distance of the boundary frame and the width-height ratio of the boundary frame, accelerate the regression speed of the target detection frame in the training process and improve the positioning precision of the boundary frame. The CIOU formula is as follows:
Figure BDA0003476348550000161
b is a predicted bounding box, bgtFor true bounding boxes, p2(b,bgt) Representing the euclidean distances of the center points of the predicted frame and the real frame, respectively. c represents the diagonal distance of the minimum closure area that can contain both the prediction box and the real box. α is a weighting function, v is used to measure the similarity of the aspect ratios, and the formula for α and v is as follows:
Figure BDA0003476348550000162
Figure BDA0003476348550000163
the CIOU loss equation is as follows:
LossCIOU=1-CIOU
Figure BDA0003476348550000164
wherein, wgtIs the width of the real bounding box, hgtIs the height of the true bounding box, w is the width of the predicted bounding box, and h is the height of the predicted bounding box.
The invention replaces the original NMS with the DIOU-NMS, improves the missed detection phenomenon caused by the shielding of the target and improves the precision of the target detection under the shielding condition.
In the conventional NMS, the IOU is commonly used to suppress the redundant detection boxes, but since the IOU performs analysis only through the overlapping region, error suppression is easily generated for the case of target occlusion, resulting in the case of missed detection. In order to solve the problem, the invention adopts a DIOU-NMS (Distance-IOU non-extreme value inhibition method) to modify the IOU in the NMS into the DIOU, analyzes the overlapping area in the inhibition criterion, and calculates the Distance between the central points of two rectangular frames, thus being more suitable for the target detection task under the condition of road traffic scene occlusion.
Assuming that the algorithm model detects that a candidate box set is B and the corresponding category confidence set is s, the classification score of the prediction box M with the highest score is updated, and the formula is as shown in the specification:
Figure BDA0003476348550000165
RDIOUrefers to the result obtained by DIOU-NMS processing, the input of which is siAnd BiWhere i refers to the ith in a set for iterative computation; siIs the classification score; ε is the NMS threshold. Highest scoring prediction box M and other boxes BiThe IOU-DIOU value of (B) is relatively small, BiScore s of (2)iStill remaining, otherwise, when IOU-DIOU is greater than NMS threshold value, siThe value is set to 0, i.e., filtered out. For two rectangular frames with farther center points, it is possible to locate on different objects and therefore not delete them directly. The DIOU-NMS deletes the candidate box B by analyzing the IOU of the two rectangular boxes and the distance between the center pointsiThe accuracy of target detection under the shielding condition can be improved.
Hardswish is an activation function, uses piecewise linear simulation, and has the advantages of good numerical stability and high calculation speed. The Hardswish activation function formula is defined as follows.
Figure BDA0003476348550000171
Where x is an input value.
And others:
1. the classification loss adopts a binary cross entropy loss function.
2. The prior frame scale is extracted by using K-means clustering, the number of class centers of the K-means clustering is set to be K, K prior frames Anchor Box are taken, important hyperparameters K are automatically acquired in training, and the algorithm is also a large-characteristic self-adaptive Anchor.
3. In the optimization process, L2 regularization is adopted for parameters of the model, the L2 regularization enables the weight to be attenuated, the output of the weight is reduced after the attenuation, and therefore the processed network cannot over-fit each value, and the function of preventing over-fitting is achieved.
4. In the aspect of learning rate, the learning rate of the model is adjusted at intervals, so that the learning rate of the model is larger, the model can obtain the effect of fast convergence, the model is more fast to stabilize, the smaller learning rate is changed when the model tends to be stable, the convergence of the model is slowed down, the phenomenon of overfitting is prevented, and the stable structure of the model at a deeper level is maintained.
The network structure of the improved YOLOv5 is shown in fig. 8.
The component structure of the improved YOLOv5 is shown in fig. 9.
A second part: binocular stereo vision
Calculating the parallax of the left camera image and the right camera image based on binocular stereo vision to obtain a depth map; and then, combining the target position, the category and the confidence coefficient in the two-dimensional image, calculating information such as the distance, the spatial coordinate, the physical size and the like of the detected target, thereby enriching perception information and improving reliability.
Through binocular calibration, the internal reference of the left camera, the internal reference of the right camera and the rotation parameters, the translation parameters, the tangential distortion and the radial distortion between the two cameras are obtained.
Through binocular correction, lens deformation is eliminated, and a stereo camera pair is converted into a standard form; so that the two images are in the same object, the two images have the same size and are horizontally in a straight line. The binocular correction mainly comprises 4 parts, and is implemented by firstly inputting an original image, then acquiring calibration parameters such as tangential distortion and radial distortion and the like, and eliminating distortion. And performing binocular correction through an algorithm program, and finally cutting the image to obtain the image in a standard form.
The binocular stereo Matching is to find corresponding points of left and right camera images, and the stereo Matching is carried out based on an SGBM algorithm, which is called Semi-Global Block Matching in a whole and is a Semi-Global Matching algorithm. The flow mainly comprises 4 parts, namely, firstly calculating the matching cost, and calculating the matching cost between two pixels in a left image and a right image; the greater the matching cost, the lower the probability that the two pixels are represented as corresponding points. And then carrying out cost aggregation and parallax calculation, and finally optimizing the parallax to generate a parallax map.
And (3) binocular distance measurement, wherein a disparity map, a base line and a focal length are given, and the corresponding position in world coordinates is calculated through triangulation, namely the distance Z is obtained. The distance principle is calculated as shown in fig. 10.
Parallax: d ═ xl-xr
Similar to the triangle principle:
Figure BDA0003476348550000191
Figure BDA0003476348550000192
Figure BDA0003476348550000193
wherein f is the focal length, which refers to the distance between the sensor and the lens; d is parallax, meaning that the same spatial point is at the left camera pixel point (x)l,yl) And the corresponding point (x) in the right camerar,yr) The difference of the corresponding x-coordinates. T refers to the distance between the lenses of the two cameras.
And a third part: resolving ultrasonic radar data
The ultrasonic radar realizes short-distance measurement according to the principle of radar wave emission-rebound; and after the data is processed by the development mainboard, outputting the data of the real-time distance of the radar probe, and encoding the data. And after decoding, real-time ranging information of each ultrasonic radar.
The received data range is determined to be between 30 and 250cm in consideration of errors, the magnitude of transmission power, sensitivity and the like. Distances of less than 30cm or 0 are shown as 30cm, and distances of 250cm or more or no object are shown as 250 cm.
The ultrasonic radar adopts 48-woven signal shielding wires and has the characteristic of strong anti-interference capability; 12V direct current power supply; center frequency 40 kHz; the emission sound pressure is more than 105dB @30cm/10V sine wave; the receiving sensitivity is more than-82 dB/v/mu bar; the detection angle is 80deg.
According to the target detection method based on the improved YOLOv5 and binocular stereoscopic vision, an SE module is added to a backbone network of YOLOv5, the characteristic information aiming at the channel is screened, and the characteristic expression capability is improved. In order to improve the recognition effect of the detector and aim at the problem of low vehicle recognition rate, an attention mechanism is fused with a detection network, namely an attention module CBAM is fused with a Neck part of a YOLOv5 network, the capability of the model for extracting features is enhanced, and the model focuses more on the detected target. The CIOU loss function is used for replacing the regression loss function of the original detection frame, so that the problems of low positioning precision and low regression speed of the target detection frame in the training process are solved. And replacing the original NMS by using the DIOU-NMS to improve the missed detection phenomenon caused by the shielding of the target and improve the target detection precision under the shielding condition. Calculating the parallax of the left camera image and the right camera image based on binocular stereo vision to obtain a depth map; and then, combining the target position, the category and the confidence coefficient in the two-dimensional image, calculating information such as the distance, the spatial coordinate, the physical size and the like of the detected target, thereby enriching perception information and improving reliability. 8 ultrasonic radar are installed around the vehicle to detect whether objects exist in a short distance or not, so that a blind area of a binocular stereoscopic vision in a short distance is made up, and safety is improved. The actual effect is shown in fig. 11-14.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A target detection method based on improved YOLOv5 and binocular stereo vision is characterized by comprising the following steps:
adding an SE module to a backbone network of YOLOv5, fusing a focus module CBAM and a Neck part of a YOLOv5 network, replacing an original detection frame regression loss function with a CIOU loss function, replacing an original NMS with a DIOU-NMS, replacing an original SiLU activation function with a Hardswish activation function after convolution operation, training an improved YOLOv5 network, and performing target detection based on an improved YOLOv5 model;
calculating the parallax of the left camera image and the right camera image based on binocular stereo vision to obtain a depth map; then, the distance, the spatial coordinate and the physical size of the detected target are calculated by combining the target position, the category and the confidence coefficient in the two-dimensional image, so that perception information is enriched and the reliability is improved;
installing ultrasonic radars around the vehicle, analyzing ultrasonic radar data, detecting whether objects exist in a short distance, and making up blind areas in a short distance of binocular stereoscopic vision;
performing fusion calculation and outputting a sensing result, wherein the fusion calculation comprises the following steps: type, confidence, three-dimensional coordinates, physical dimension length and width.
2. The method for detecting an object based on improved YOLOv5 and binocular stereo vision according to claim 1, wherein the YOLOv5 is divided into four parts, including an Input, a Backbone network of Backbone, a Neck network, and a Head output;
the Input end contains preprocessing of data, including Mosaic data enhancement, adaptive image filling, and in order to accommodate different data sets, YOLOv5 integrates adaptive anchor frame calculation at the Input end to automatically set the initial anchor frame size when a data set is changed;
the Backbone network of the backhaul extracts features of different levels from an image through deep convolution operation, and utilizes a Bottleneck cross-stage local structure Bottleneck CSP and a spatial pyramid pooling SPP, wherein the purpose of the Bottleneck cross-stage local structure is to reduce the calculated amount and improve the reasoning speed, and the purpose of the Bottleneck cross-stage local structure Bottleneck CSP and the spatial pyramid pooling SPP is to realize feature extraction of different scales on the same feature map, which is beneficial to improvement of detection precision;
the Neck network layer comprises a feature pyramid FPN and a path aggregation structure PAN, wherein the FPN transmits semantic information from top to bottom in the network, and the PAN transmits positioning information from bottom to top, so that information of different network layers in the backhaul is fused, and the detection capability is further improved;
the Head output end is used as a final detection part for predicting targets with different sizes on feature maps with different sizes.
3. The target detection method based on improved YOLOv5 and binocular stereo vision according to claim 1, wherein adding an SE module to a YOLOv5 backbone network specifically comprises:
the SE module adopts a one-dimensional vector with the same number as the channels as the evaluation score of each channel, and then applies the evaluation score to the corresponding channels respectively to process the output feature mapping; the SE module is used for learning the correlation among the channels, screening out the characteristic information aiming at the channels and improving the characteristic expression capacity;
adding an SE module behind an SPP module in a backbone network;
setting input W X H X C image data, wherein W is image width, H is image height, and C is image channel number;
the SE module firstly performs a compressed Squeeze operation on the input, wherein the process is a global average pooling operation, and the feature map becomes a vector of 1x1xC after the global average pooling operation;
then, Excitation operation is carried out on the vector of 1x1xC, the input vector of 1x1xC is changed into 1x1xCxSERadio through a full connection layer, the SERadio is a scaling factor and plays a role in reducing the number of channels and optimizing calculated quantity, and then 1x1xC and an activation function are obtained through the activation function through the full connection layer again to obtain 1x1 xC;
finally, the branch is processed by a scale, the input of the scale is the excited output 1x1xC and the input W x H x C of the whole module, and the channel weights of the excited output 1x1xC and the input W x H x C are multiplied, and the output of the weight value of each channel corresponding to the characteristic diagram input by the SE module is obtained by the SE module.
4. The method for detecting targets based on improved YOLOv5 and binocular stereovision according to claim 1, wherein the fusing of the attention module CBAM with the Neck part of the YOLOv5 network is specifically:
in the CNN network, an attention mechanism is used on the feature map for acquiring available attention information in the feature map, including spatial attention and channel attention information; the convolution attention module CBAM pays attention to space and channel information at the same time, and reconstructs a characteristic diagram in the middle of the network through the channel attention module CAM and the space attention module SAM, so as to emphasize important characteristics, inhibit general characteristics and achieve the purpose of improving the target detection effect;
the feature map obtained through the network is input to a CBAM module, the CBAM module is divided into two parts, the input feature map is firstly convoluted and then is sent to a channel attention module in the CBAM module, and finally feature adjustment is carried out on the input feature map through a space attention module to obtain the output of the whole module;
in the convolution operation of YOLOv5, a layer of output three-dimensional feature graph F is provided, wherein the number of channels C, the height H and the width W are F epsilon RCxHxW(ii) a The CBAM deduces a one-dimensional channel attention Mc and a two-dimensional space attention Ms from the feature map F in sequence, and the feature maps are multiplied element by element to obtain an output feature map of the F channel dimension;
setting Mc(F) Representing the CAM module to carry out channel attention reconstruction on the feature map and outputting an F' feature map; ms(F ') represents the spatial attention reconstruction of the channel attention output F' by the SAM module;
Figure FDA0003476348540000031
representing element-by-element multiplication; is formulated as follows:
Figure FDA0003476348540000032
Figure FDA0003476348540000033
the channel attention module CAM is used for simultaneously carrying out maximum pooling and average pooling on each channel of the input feature map F, enabling the obtained intermediate vector to pass through a multilayer perceptron MLP, finally carrying out element-by-element addition on the feature vectors output by the multilayer perceptron MLP, carrying out Sigmoid activation operation, taking a scaling factor through a Sigmoid activation function, multiplying the scaling factor by the input feature map to obtain the output of the space attention module, and finally obtaining a channel attention feature map F'; in order to reduce the calculation amount, only one hidden layer is designed for the multi-layer perceptron MLP;
the space attention module SAM is used for performing maximum pooling and average pooling on the channel attention module output F 'along the channel direction, splicing the two operated outputs, acquiring a scaling factor of the space attention module through a convolution and sigmoid activation function, multiplying the scaling factor of the space attention module and the output of the channel attention module to obtain the output of the space attention module, and finally obtaining a space attention feature map F';
finally, adding the outputs F 'and F' of the two module groups and the input of the CBAM module to obtain a new characteristic of the whole CBAM module output;
the most important operation of the attention mechanism is to highlight important information in the characteristic diagram and suppress general information; extracting the most critical part of the features from the YOLOv5 network to be in the Backbone network Backbone, therefore, fusing the CBAM module to the partial output of the Backbone network Backbone and before the features of the Neck network are fused, completing the feature extraction in the Backbone network by the design consideration, predicting and outputting on different feature maps after the Neck feature fusion, and performing attention reconstruction on the CBAM module to play a role in starting from the top;
and a CBAM module is added before the three Neck feature fusion branches respectively, so that important information in the feature maps is highlighted, prediction output on different feature maps is realized through later further feature extraction, and the purpose of improving the target detection effect is achieved.
5. The target detection method based on the improved YOLOv5 and the binocular stereo vision as claimed in claim 1, wherein the replacing of the original detection frame regression loss function with the CIOU loss function is specifically as follows:
in the aspect of a loss function, a CIOU loss function is adopted for regression of frame information; the IOU loss function considers the overlapping area of the detection frame and the target frame; the GIOU loss function solves the problem when the boundary frames are not overlapped on the basis of the IOU; the DIOU loss function considers the information of the center distance of the bounding box on the basis of the IOU; the CIOU loss function considers the scale information of the width-to-height ratio of the bounding box on the basis of the DIOU;
the GIOU calculates the area Ac of the intersection part of the two bounding boxes, namely the area of the minimum box which simultaneously comprises the prediction box and the real box; then obtaining an IOU through a union set of the two bounding boxes; finding out the area U of the intersected part of the two boundary frames, which is not in the area U of the two boundary frames and accounts for the proportion of the whole intersected part; finally, the initial IOU is subtracted from this ratio to obtain GIOU, which is expressed as follows:
Figure FDA0003476348540000051
LossGIOU=1-GIOU
LossGIOUis a GIOU loss function, the GIOU has a symmetrical interval and a value range of [ -1,1 [ -1 [ ]](ii) a Taking a maximum value of 1 when the two coincide, and taking a minimum value of-1 when the two do not intersect and are infinite; the GIOU not only focuses on the overlapping region, but also focuses on other non-overlapping regions, and can better reflect the contact ratio of the overlapping region and the non-overlapping region; however, in the loss in the regression of the bounding box, GIOU does not consider the information of the center distance of the bounding box and the scale information of the aspect ratio of the bounding box;
the CIOU can simultaneously consider the overlapping area of the detection frame and the target frame, the center distance of the boundary frame and the width-height ratio of the boundary frame, accelerate the regression speed of the target detection frame in the training process and improve the positioning precision of the boundary frame; the CIOU formula is as follows:
Figure FDA0003476348540000052
b is a predicted bounding box, bgtAs a true bounding box, p2(b,bgt) Respectively representing Euclidean distances of central points of the prediction frame and the real frame; c represents the diagonal distance of the minimum closure area which can contain the prediction box and the real box at the same time; α is a weighting function, v is used to measure the similarity of the aspect ratios, and the formula for α and v is as follows:
Figure FDA0003476348540000053
Figure FDA0003476348540000054
the CIOU loss equation is as follows:
LossCIOU=1-CIOU
Figure FDA0003476348540000061
wherein, wgtIs the width of the real bounding box, hgtIs the height of the true bounding box, w is the width of the predicted bounding box, and h is the height of the predicted bounding box.
6. The method for object detection based on improved YOLOv5 and binocular stereo vision according to claim 1, wherein replacing the original NMS with DIOU-NMS is specifically:
the IOU in the NMS is modified into the DIOU by adopting DIOU-NMS, the overlapping area is analyzed in the inhibition criterion, and the distance of the central point between two rectangular frames is calculated, so that the method is more suitable for being used in a target detection task under the condition of road traffic scene shielding;
assuming that the algorithm model detects that a candidate box set is B and the corresponding category confidence set is s, the classification score of the prediction box M with the highest score is updated, and the formula is as shown in the specification:
Figure FDA0003476348540000062
RDIOUrefers to the result obtained by DIOU-NMS processing, the input of which is siAnd BiWhere i refers to the ith in a set for iterative computation; siIs the classification score; ε is the NMS threshold; highest scoring prediction box M and other boxes BiThe IOU-DIOU value of (B) is relatively small, BiScore value s ofiStill remaining, otherwise, when IOU-DIOU is greater than NMS threshold value, siThe value is set to 0, i.e., filtered out; for two rectangular frames with farther central points, the rectangular frames may be located on different objects, and therefore the rectangular frames cannot be directly deleted; the DIOU-NMS deletes the candidate box B by analyzing the IOU of the two rectangular boxes, and the distance between the center pointsiAnd the accuracy of target detection under the shielding condition is improved.
7. The method for detecting targets based on improved YOLOv5 and binocular stereo vision as claimed in claim 1, wherein the Hardswish activation function formula is defined as follows
Figure FDA0003476348540000071
Where x is an input value.
8. The target detection method based on the improved YOLOv5 and the binocular stereo vision according to claim 1, wherein the left camera internal reference, the right camera internal reference, the rotation parameter between the two cameras, the translation parameter, the tangential distortion and the radial distortion are obtained through binocular calibration;
through binocular correction, lens deformation is eliminated, and a stereo camera pair is converted into a standard form; enabling the two images to be in the same object, wherein the two images have the same size and are horizontally arranged on the same straight line; the binocular correction mainly comprises 4 parts, firstly, an original image is input, then calibration parameters such as tangential distortion and radial distortion are obtained, and distortion is eliminated; and carrying out binocular correction through an algorithm program, and finally cutting the image to obtain the image in a standard form.
9. The target detection method based on the improved YOLOv5 and binocular stereo vision according to claim 1, wherein the binocular stereo matching is to find the corresponding points of the left and right camera images, and is a semi-global matching algorithm by using an SGBM-based algorithm to perform stereo matching; firstly, calculating matching cost, namely calculating the matching cost between two pixels in a left image and a right image; the larger the matching cost is, the lower the possibility that the two pixels are corresponding points is; and then carrying out cost aggregation and parallax calculation, and finally optimizing the parallax to generate a parallax map.
10. The method for detecting targets based on improved YOLOv5 and binocular stereovision according to claim 1, wherein, binocular distance measurement is performed, given a disparity map, a base line and a focal length, the corresponding position in world coordinates is calculated through triangulation, that is, the distance Z is obtained;
parallax: d ═ xl-xr
Similar to the triangle principle:
Figure FDA0003476348540000081
Figure FDA0003476348540000082
Figure FDA0003476348540000083
where f is the focal length, from sensor to lensThe distance of (a); d is parallax, meaning that the same spatial point is at the left camera pixel point (x)l,yl) And the corresponding point (x) in the right camerar,yr) The difference of the corresponding x-coordinates; t refers to the distance between the lenses of the two cameras.
CN202210055550.8A 2022-01-18 2022-01-18 Target detection method based on improved YOLOv5 and binocular stereo vision Pending CN114565900A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210055550.8A CN114565900A (en) 2022-01-18 2022-01-18 Target detection method based on improved YOLOv5 and binocular stereo vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210055550.8A CN114565900A (en) 2022-01-18 2022-01-18 Target detection method based on improved YOLOv5 and binocular stereo vision

Publications (1)

Publication Number Publication Date
CN114565900A true CN114565900A (en) 2022-05-31

Family

ID=81711083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210055550.8A Pending CN114565900A (en) 2022-01-18 2022-01-18 Target detection method based on improved YOLOv5 and binocular stereo vision

Country Status (1)

Country Link
CN (1) CN114565900A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114998932A (en) * 2022-06-10 2022-09-02 哈工大机器人集团股份有限公司 Pedestrian detection method and system based on YOLOv4
CN115063691A (en) * 2022-07-04 2022-09-16 西安邮电大学 Small target detection method based on feature enhancement under complex scene
CN115590584A (en) * 2022-09-06 2023-01-13 汕头大学(Cn) Hair follicle hair taking control method and system based on mechanical arm
CN115620153A (en) * 2022-12-16 2023-01-17 成都理工大学 Method and device for grading surface acoustic wave mill of track
CN115908791A (en) * 2023-01-06 2023-04-04 北京铸正机器人有限公司 Pharynx swab sampling method and device
CN116071309A (en) * 2022-12-27 2023-05-05 中国电子产品可靠性与环境试验研究所((工业和信息化部电子第五研究所)(中国赛宝实验室)) Method, device, equipment and storage medium for detecting sound scanning defect of component
CN116246282A (en) * 2023-02-10 2023-06-09 青海师范大学 Scene Tibetan detection method based on improved double-attention YOLOv7
CN116245732A (en) * 2023-03-13 2023-06-09 江南大学 Yolov 5-based small-target reflective garment identification and detection method
CN116385810A (en) * 2023-06-05 2023-07-04 江西农业大学 Yolov 7-based small target detection method and system
CN116740334A (en) * 2023-06-23 2023-09-12 河北大学 Unmanned aerial vehicle intrusion detection positioning method based on binocular vision and improved YOLO
CN117017276A (en) * 2023-10-08 2023-11-10 中国科学技术大学 Real-time human body tight boundary detection method based on millimeter wave radar
CN117036363A (en) * 2023-10-10 2023-11-10 国网四川省电力公司信息通信公司 Shielding insulator detection method based on multi-feature fusion
CN117557911A (en) * 2023-12-15 2024-02-13 哈尔滨工业大学(威海) Target perception method and system based on multi-sensor image result fusion
CN117689731A (en) * 2024-02-02 2024-03-12 陕西德创数字工业智能科技有限公司 Lightweight new energy heavy-duty truck battery pack identification method based on improved YOLOv5 model
CN118072148A (en) * 2024-04-25 2024-05-24 深圳市威远精密技术有限公司 Precise ball screw pair detection system and method thereof
CN118163880A (en) * 2024-05-14 2024-06-11 中国海洋大学 Building disease detection quadruped robot and detection method
CN118163880B (en) * 2024-05-14 2024-07-30 中国海洋大学 Building disease detection quadruped robot and detection method

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114998932A (en) * 2022-06-10 2022-09-02 哈工大机器人集团股份有限公司 Pedestrian detection method and system based on YOLOv4
CN115063691A (en) * 2022-07-04 2022-09-16 西安邮电大学 Small target detection method based on feature enhancement under complex scene
CN115063691B (en) * 2022-07-04 2024-04-12 西安邮电大学 Feature enhancement-based small target detection method in complex scene
CN115590584B (en) * 2022-09-06 2023-11-14 汕头大学 Hair follicle taking control method and system based on mechanical arm
CN115590584A (en) * 2022-09-06 2023-01-13 汕头大学(Cn) Hair follicle hair taking control method and system based on mechanical arm
CN115620153A (en) * 2022-12-16 2023-01-17 成都理工大学 Method and device for grading surface acoustic wave mill of track
CN116071309A (en) * 2022-12-27 2023-05-05 中国电子产品可靠性与环境试验研究所((工业和信息化部电子第五研究所)(中国赛宝实验室)) Method, device, equipment and storage medium for detecting sound scanning defect of component
CN116071309B (en) * 2022-12-27 2024-05-17 中国电子产品可靠性与环境试验研究所((工业和信息化部电子第五研究所)(中国赛宝实验室)) Method, device, equipment and storage medium for detecting sound scanning defect of component
CN115908791A (en) * 2023-01-06 2023-04-04 北京铸正机器人有限公司 Pharynx swab sampling method and device
CN116246282A (en) * 2023-02-10 2023-06-09 青海师范大学 Scene Tibetan detection method based on improved double-attention YOLOv7
CN116245732A (en) * 2023-03-13 2023-06-09 江南大学 Yolov 5-based small-target reflective garment identification and detection method
CN116385810A (en) * 2023-06-05 2023-07-04 江西农业大学 Yolov 7-based small target detection method and system
CN116385810B (en) * 2023-06-05 2023-08-15 江西农业大学 Yolov 7-based small target detection method and system
CN116740334B (en) * 2023-06-23 2024-02-06 河北大学 Unmanned aerial vehicle intrusion detection positioning method based on binocular vision and improved YOLO
CN116740334A (en) * 2023-06-23 2023-09-12 河北大学 Unmanned aerial vehicle intrusion detection positioning method based on binocular vision and improved YOLO
CN117017276A (en) * 2023-10-08 2023-11-10 中国科学技术大学 Real-time human body tight boundary detection method based on millimeter wave radar
CN117036363A (en) * 2023-10-10 2023-11-10 国网四川省电力公司信息通信公司 Shielding insulator detection method based on multi-feature fusion
CN117036363B (en) * 2023-10-10 2024-01-30 国网四川省电力公司信息通信公司 Shielding insulator detection method based on multi-feature fusion
CN117557911A (en) * 2023-12-15 2024-02-13 哈尔滨工业大学(威海) Target perception method and system based on multi-sensor image result fusion
CN117689731B (en) * 2024-02-02 2024-04-26 陕西德创数字工业智能科技有限公司 Lightweight new energy heavy-duty battery pack identification method based on improved YOLOv model
CN117689731A (en) * 2024-02-02 2024-03-12 陕西德创数字工业智能科技有限公司 Lightweight new energy heavy-duty truck battery pack identification method based on improved YOLOv5 model
CN118072148A (en) * 2024-04-25 2024-05-24 深圳市威远精密技术有限公司 Precise ball screw pair detection system and method thereof
CN118163880A (en) * 2024-05-14 2024-06-11 中国海洋大学 Building disease detection quadruped robot and detection method
CN118163880B (en) * 2024-05-14 2024-07-30 中国海洋大学 Building disease detection quadruped robot and detection method

Similar Documents

Publication Publication Date Title
CN114565900A (en) Target detection method based on improved YOLOv5 and binocular stereo vision
CN111723748B (en) Infrared remote sensing image ship detection method
US20230099113A1 (en) Training method and apparatus for a target detection model, target detection method and apparatus, and medium
CN110688905B (en) Three-dimensional object detection and tracking method based on key frame
CN111461221B (en) Multi-source sensor fusion target detection method and system for automatic driving
CN115346177A (en) Novel system and method for detecting target under road side view angle
CN114495064A (en) Monocular depth estimation-based vehicle surrounding obstacle early warning method
US20220129685A1 (en) System and Method for Determining Object Characteristics in Real-time
CN116079749B (en) Robot vision obstacle avoidance method based on cluster separation conditional random field and robot
CN117274749B (en) Fused 3D target detection method based on 4D millimeter wave radar and image
CN114089329A (en) Target detection method based on fusion of long and short focus cameras and millimeter wave radar
CN115909268A (en) Dynamic obstacle detection method and device
CN116699602A (en) Target detection system and method based on millimeter wave radar and camera fusion
CN114648549A (en) Traffic scene target detection and positioning method fusing vision and laser radar
Zhang et al. Front vehicle detection based on multi-sensor fusion for autonomous vehicle
CN116758506A (en) Three-dimensional vehicle detection method based on point cloud and image fusion
CN117237986A (en) Fish target individual position detection method based on improved YOLOv7 model
CN111353481A (en) Road obstacle identification method based on laser point cloud and video image
CN114608522B (en) Obstacle recognition and distance measurement method based on vision
CN115731517A (en) Crowd detection method based on Crowd-RetinaNet network
CN115457080A (en) Multi-target vehicle track extraction method based on pixel-level image fusion
CN113569803A (en) Multi-mode data fusion lane target detection method and system based on multi-scale convolution
CN113239962A (en) Traffic participant identification method based on single fixed camera
CN117576665B (en) Automatic driving-oriented single-camera three-dimensional target detection method and system
Zhang et al. CrossGAN-Detection: A generative adversarial network with directly controllable fusion for target detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination