CN114565900A

CN114565900A - Target detection method based on improved YOLOv5 and binocular stereo vision

Info

Publication number: CN114565900A
Application number: CN202210055550.8A
Authority: CN
Inventors: 黎国溥; 陈升东; 袁峰
Original assignee: Guangzhou Institute of Software Application Technology Guangzhou GZIS
Current assignee: Guangzhou Institute of Software Application Technology Guangzhou GZIS
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-05-31

Abstract

The invention discloses a target detection method based on improved YOLOv5 and binocular stereo vision, wherein an SE module is added in a backbone network of YOLOv5, characteristic information aiming at a channel is screened, and characteristic expression capability is improved; the attention module CBAM is partially fused with the Neck part of the YOLOv5 network, so that the capability of the model for extracting features is enhanced, and the model focuses more on the detected target; the CIOU loss function is used for replacing the regression loss function of the original detection frame, so that the problems of low positioning precision and low regression speed of the target detection frame in the training process are solved; replacing the original NMS with the DIOU-NMS to improve the target detection precision under the shielding condition; binocular correction, stereo matching and binocular ranging; analyzing ultrasonic radar data; and performing fusion calculation, outputting sensing information such as type, confidence coefficient, three-dimensional coordinates, physical dimension length and width and the like, and improving the safety.

Description

Target detection method based on improved YOLOv5 and binocular stereo vision

Technical Field

The invention relates to the technical field of automatic driving in intelligent transportation, in particular to a target detection method based on improved YOLOv5 and binocular stereo vision.

Background

Many people die from traffic accidents every year in China, so that the families suffer from the pain of family casualties. In recent years, automatic driving aiming at a significant reduction in accidents has become a social need. The core of an automatic driving system can be divided into three parts: sensing, planning and controlling. The perception is that information is collected from the driving environment of the vehicle and relevant knowledge is extracted for later planning and control, and is a basic link in the implementation process of the automatic driving technology. The three-dimensional target detection is an important branch of an environment perception system in the field of automatic driving, and has very important research significance for promoting traffic safety.

The traditional target detection method is mainly based on feature learning, when region selection is carried out, time complexity is high due to image traversal, and meanwhile robustness is poor due to form diversity, illumination diversity and background diversity during feature extraction. In order to overcome the limitations of the traditional machine learning method, scholars extract a Convolutional Neural Network (CNN) based on deep learning. Compared with the traditional method, the convolutional neural network can accurately extract proper features without additionally designing specific features. Detection methods based on convolutional neural networks can be divided into two main categories, namely one-stage and two-stage. The two-stage method represented by fast R-CNN uses RPN to generate an advice frame on a feature level by sharing convolution features, and then uses the convolution features of an advice frame area to classify and position and learn a target frame, so that the method has the characteristics of high precision and low speed. In a one-stage target detection algorithm represented by YOLO, a positioning and identifying task of a target frame is predicted and completed at one time on an output layer according to regression logic, and the one-stage target detection algorithm is widely applied to a target detection task due to high detection speed.

The performance of the YOLOv5 as the latest version of the current YOLO network series is obviously improved compared with the performance of the previous version, but the detection precision is still not high enough under the current complex environment background.

In an automatic driving application scene, the two-dimensional target detection cannot provide all information required by sensing the environment; two-dimensional target detection can only provide the position of a target in a two-dimensional image and the confidence of a corresponding category; however, in the real three-dimensional world, objects have three-dimensional shapes, and most applications require information such as spatial coordinates and physical dimensions of objects.

In vision-based target detection, due to the limitation of the field angle of a camera, a certain blind area exists in the detection range.

Disclosure of Invention

In view of this, the invention provides a target detection method based on improved YOLOv5 and binocular stereo vision, which aims at the situation that in a low-speed automatic driving scene of a park, an automatic driving vehicle needs to quickly and accurately identify a front vehicle, a pedestrian, an obstacle and other targets so as to achieve autonomous obstacle avoidance in vehicle track cruising.

The invention solves the problems through the following technical means:

a target detection method based on improved YOLOv5 and binocular stereo vision comprises the following steps:

adding an SE module to a backbone network of YOLOv5, fusing a focus module CBAM and a Neck part of a YOLOv5 network, replacing an original detection frame regression loss function with a CIOU loss function, replacing an original NMS with a DIOU-NMS, replacing an original SiLU activation function with a Hardswish activation function after convolution operation, training an improved YOLOv5 network, and performing target detection based on an improved YOLOv5 model;

calculating the parallax of the left camera image and the right camera image based on binocular stereo vision to obtain a depth map; then, the distance, the spatial coordinate and the physical size of the detected target are calculated by combining the target position, the category and the confidence coefficient in the two-dimensional image, so that perception information is enriched and the reliability is improved;

installing ultrasonic radars around the vehicle, analyzing the ultrasonic radar data, detecting whether an object exists in a short distance, and making up a blind area in a short distance of binocular stereoscopic vision;

performing fusion calculation and outputting a sensing result, wherein the fusion calculation comprises the following steps: type, confidence, three-dimensional coordinates, physical dimension length and width.

Further, the structure of YOLOv5 is divided into four parts, including an Input end, a Backbone network, a Neck network and a Head output end;

the Input end contains preprocessing of data, including Mosaic data enhancement, adaptive image filling, and in order to accommodate different data sets, YOLOv5 integrates adaptive anchor frame calculation at the Input end to automatically set the initial anchor frame size when a data set is changed;

the Backbone network of the backhaul extracts features of different levels from an image through deep convolution operation, and utilizes a Bottleneck cross-stage local structure Bottleneck CSP and a spatial pyramid pooling SPP, wherein the purpose of the Bottleneck cross-stage local structure is to reduce the calculated amount and improve the reasoning speed, and the purpose of the Bottleneck cross-stage local structure Bottleneck CSP and the spatial pyramid pooling SPP is to realize feature extraction of different scales on the same feature map, which is beneficial to improvement of detection precision;

the Neck network layer comprises a characteristic pyramid FPN and a path aggregation structure PAN, the FPN transmits semantic information from top to bottom in the network, the PAN transmits positioning information from bottom to top, information of different network layers in the backhaul is fused, and the detection capability is further improved;

the Head output end is used as a final detection part for predicting targets with different sizes on feature maps with different sizes.

Further, adding an SE module to the YOLOv5 backbone network specifically includes:

the SE module adopts a one-dimensional vector with the same number as the channels as the evaluation score of each channel, and then applies the evaluation score to the corresponding channels respectively to process the output feature mapping; the SE module is used for learning the correlation among the channels, screening out the characteristic information aiming at the channels and improving the characteristic expression capability;

adding an SE module behind an SPP module in a backbone network;

setting input W X H X C image data, wherein W is image width, H is image height, and C is image channel number;

the SE module firstly performs a compressed Squeeze operation on the input, wherein the process is a global average pooling operation, and the feature map becomes a vector of 1x1xC after the global average pooling operation;

then, Excitation operation is carried out on the vector of 1x1xC, the input vector of 1x1xC is changed into 1x1xCxSERadio through a full connection layer, the SERadio is a scaling factor and plays a role in reducing the number of channels and optimizing calculated quantity, and then 1x1xC and an activation function are obtained through the activation function through the full connection layer again to obtain 1x1 xC;

finally, the branch is processed by a scale, the input of the scale is the excited output 1x1xC and the input W x H x C of the whole module, and the channel weights of the excited output 1x1xC and the input W x H x C are multiplied, and the output of the weight value of each channel corresponding to the characteristic diagram input by the SE module is obtained by the SE module.

Further, fusing the attention module CBAM with the Neck part of the YOLOv5 network specifically includes:

in the CNN network, an attention mechanism is used on the feature map for acquiring available attention information in the feature map, including spatial attention and channel attention information; the convolution attention module CBAM pays attention to space and channel information at the same time, and reconstructs a feature map in the middle of the network through the channel attention module CAM and the space attention module SAM, emphasizes important features, inhibits general features and achieves the aim of improving the target detection effect;

the feature map obtained through the network is input to a CBAM module, the CBAM module is divided into two parts, the input feature map is firstly convoluted and then is sent to a channel attention module in the CBAM module, and finally feature adjustment is carried out on the input feature map through a space attention module to obtain the output of the whole module;

in the convolution operation of YOLOv5, a layer of output three-dimensional feature graph F is provided, wherein the number of channels C, the height H and the width W are F epsilon R^cxHxW(ii) a The CBAM deduces a one-dimensional channel attention Mc and a two-dimensional space attention Ms from the feature map F in sequence, and the feature maps are multiplied element by element to obtain an output feature map of the F channel dimension;

setting M_c(F) Representing the CAM module to carry out channel attention reconstruction on the feature map and outputting an F' feature map; m is a group of_s(F ') represents the spatial attention reconstruction of the channel attention output F' by the SAM module;

representing element-by-element multiplication; is formulated as follows:

the channel attention module CAM is used for simultaneously carrying out maximum pooling and average pooling on each channel of the input feature map F, enabling the obtained intermediate vector to pass through a multilayer perceptron MLP, finally carrying out element-by-element addition on the feature vectors output by the multilayer perceptron MLP, carrying out Sigmoid activation operation, taking a scaling factor through a Sigmoid activation function, multiplying the scaling factor by the input feature map to obtain the output of the space attention module, and finally obtaining a channel attention feature map F'; in order to reduce the calculation amount, only one hidden layer is designed for the multi-layer perceptron MLP;

the space attention module SAM is used for performing maximum pooling and average pooling on the channel attention module output F 'along the channel direction, splicing the two operated outputs, acquiring a scaling factor of the space attention module through a convolution and sigmoid activation function, multiplying the scaling factor of the space attention module and the output of the channel attention module to obtain the output of the space attention module, and finally obtaining a space attention feature map F';

finally, adding the outputs F 'and F' of the two module groups and the input of the CBAM module to obtain a new characteristic of the whole CBAM module output;

the most important operation of the attention mechanism is to highlight important information in the characteristic diagram and suppress general information; extracting the most critical part of the features from the YOLOv5 network to be in the Backbone network Backbone, therefore, fusing the CBAM module to the partial output of the Backbone network Backbone and before the features of the Neck network are fused, completing the feature extraction in the Backbone network by the design consideration, predicting and outputting on different feature maps after the Neck feature fusion, and performing attention reconstruction on the CBAM module to play a role in starting from the top;

and a CBAM module is added before the three Neck feature fusion branches respectively, so that important information in the feature maps is highlighted, prediction output on different feature maps is realized through later further feature extraction, and the purpose of improving the target detection effect is achieved.

Further, replacing the original regression loss function of the detection box with the CIOU loss function specifically includes:

in the aspect of a loss function, a CIOU loss function is adopted for regression of frame information; the IOU loss function considers the overlapping area of the detection frame and the target frame; the GIOU loss function solves the problem when the boundary frames are not overlapped on the basis of the IOU; the DIOU loss function considers the information of the center distance of the bounding box on the basis of the IOU; the CIOU loss function considers the scale information of the width-to-height ratio of the bounding box on the basis of the DIOU;

the GIOU calculates the area Ac of the intersection part of the two bounding boxes, namely the area of the minimum box which simultaneously comprises the prediction box and the real box; then obtaining an IOU through a union set of the two bounding boxes; finding out the area U of the intersected part of the two boundary frames, which is not in the area U of the two boundary frames and accounts for the proportion of the whole intersected part; finally, the ratio subtracted by the initial IOU is GIOU, and the formula is as follows:

Loss_GIOU＝1-GIOU

Loss_GIOUis a GIOU loss function, the GIOU has a symmetrical interval and a value range of [ -1,1 [ -1 [ ]](ii) a Taking a maximum value of 1 when the two coincide, and taking a minimum value of-1 when the two do not intersect and are infinite; the GIOU not only focuses on the overlapping region, but also focuses on other non-overlapping regions, and can better reflect the contact ratio of the overlapping region and the non-overlapping region; however, in the loss in the bounding box regression, the GIOU does not consider the information of the center distance of the bounding box and the scale information of the width-to-height ratio of the bounding box;

the CIOU can simultaneously consider the overlapping area of the detection frame and the target frame, the center distance of the boundary frame and the width-height ratio of the boundary frame, accelerate the regression speed of the target detection frame in the training process and improve the positioning precision of the boundary frame; the CIOU formula is as follows:

b is a predicted bounding box, b^gtAs a true bounding box, p²(b,b^gt) Respectively representing Euclidean distances of central points of the prediction frame and the real frame; c represents the diagonal distance of the minimum closure area which can contain the prediction box and the real box at the same time; α is a weighting function, v is used to measure the similarity of the aspect ratios, and the formula for α and v is as follows:

the CIOU loss equation is as follows:

Loss_CIOU＝1-CIOU

wherein, w^gtIs the width of the real bounding box, h^gtIs the height of the true bounding box, w is the width of the predicted bounding box, and h is the height of the predicted bounding box.

Further, replacing the original NMS with the DIOU-NMS is specifically:

the IOU in the NMS is modified into the DIOU by adopting DIOU-NMS, the overlapping area is analyzed in the inhibition criterion, and the distance of the central point between two rectangular frames is calculated, so that the method is more suitable for being used in a target detection task under the condition of road traffic scene shielding;

assuming that the algorithm model detects that a candidate box set is B and the corresponding category confidence set is s, the classification score of the prediction box M with the highest score is updated, and the formula is as shown in the specification:

R_DIOUrefers to the result obtained by DIOU-NMS processing, the input of which is s_iAnd B_iWhere i refers to the ith in a set for iterative computation; s_iIs the classification score; ε is the NMS threshold; highest scoring prediction box M and other boxes B_iThe IOU-DIOU value of (B) is relatively small, B_iScore value s of_iStill remaining, otherwise, when IOU-DIOU is greater than NMS threshold value, s_iThe value is set to 0, i.e., filtered out; for two rectangular frames with farther central points, the rectangular frames may be located on different objects, and therefore the rectangular frames cannot be directly deleted; the DIOU-NMS deletes the candidate box B by analyzing the IOU of the two rectangular boxes and the distance between the center points_iAnd the accuracy of target detection under the shielding condition is improved.

Further, the Hardswish activation function formula is defined as follows

Where x is an input value.

Further, through binocular calibration, the internal reference of a left camera, the internal reference of a right camera, and the rotation parameter, the translation parameter, the tangential distortion and the radial distortion between the two cameras are obtained;

through binocular correction, lens deformation is eliminated, and a stereo camera pair is converted into a standard form; enabling the two images to be in the same object, wherein the two images have the same size and are horizontally arranged on the same straight line; the binocular correction mainly comprises 4 parts, firstly, an original image is input, then calibration parameters such as tangential distortion and radial distortion are obtained, and distortion is eliminated; and performing binocular correction through an algorithm program, and finally cutting the image to obtain the image in a standard form.

Furthermore, binocular stereo matching is to find corresponding points of left and right camera images, and is a semi-global matching algorithm by adopting an SGBM (generalized sparse broadcast multicast group) algorithm to carry out stereo matching; firstly, calculating matching cost, namely calculating the matching cost between two pixels in a left image and a right image; the larger the matching cost is, the lower the possibility that the two pixels are corresponding points is; and then carrying out cost aggregation and parallax calculation, and finally optimizing the parallax to generate a parallax map.

Further, binocular distance measurement is carried out, a disparity map, a base line and a focal length are given, and the corresponding position in world coordinates is calculated through triangulation, namely the distance Z is obtained;

parallax: d ═ x_l-x_r

Similar to the triangle principle:

wherein f is the focal length, which refers to the distance between the sensor and the lens; d is parallax, meaning that the same spatial point is at the left camera pixel point (x)_l,y_l) And the corresponding point (x) in the right camera_r,y_r) The difference of the corresponding x-coordinates; t refers to the distance between the lenses of the two cameras.

Compared with the prior art, the invention has the beneficial effects that at least:

1. the improved YOLOv5 model strengthens the capability of extracting features, improves the positioning accuracy, reduces the regression time of a target detection frame in the training process, improves the accuracy of target detection under the shielding condition and improves the recognition effect of a detector.

2. Based on the binocular stereo vision technology, information such as the distance, the space coordinates and the physical size of a detected target is provided, so that perception information is enriched, and reliability is improved.

3. 8 ultrasonic radar are installed around the vehicle to detect whether objects exist in a short distance or not, so that a blind area of a binocular stereoscopic vision in a short distance is made up, and safety is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of the key steps of the target detection method based on improved YOLOv5 and binocular stereo vision of the present invention;

FIG. 2 is a system flow diagram of the object detection method of the present invention based on improved YOLOv5 and binocular stereo vision;

FIG. 3 is a diagram of the YOLOv5 network architecture according to the present invention;

FIG. 4 is a block diagram of the YOLOv5 component of the present invention;

FIG. 5 is a block diagram of the improved YOLOv5 backbone network of the present invention;

FIG. 6 is a CBAM structural diagram of the convolution attention module of the present invention;

FIG. 7 is a block diagram of an improved YOLOv5 Neck network of the present invention;

FIG. 8 is a diagram of the improved YOLOv5 network architecture of the present invention;

FIG. 9 is a block diagram of the improved YOLOv5 module of the present invention;

FIG. 10 is a schematic diagram of the binocular range finding of the present invention;

FIG. 11 is a diagram of the target detection effect of the improved YOLOv5 model;

FIG. 12 is a depth effect map based on binocular stereo vision of the present invention, wherein the left side is a real-time camera image and the right side is a depth map;

FIG. 13 is a depth effect map based on binocular stereo vision of the present invention, wherein the left side is the real-time camera image and the right side is the depth map;

fig. 14 is a depth effect map based on binocular stereo vision of the present invention, in which the real-time camera image is on the left and the depth map is on the right.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It should be noted that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work based on the embodiments of the present invention belong to the protection scope of the present invention.

As shown in fig. 1, the present invention provides a target detection method based on improved YOLOv5 and binocular stereo vision, comprising the following steps:

installing ultrasonic radars around the vehicle, analyzing ultrasonic radar data, detecting whether objects exist in a short distance, and making up blind areas in a short distance of binocular stereoscopic vision;

The invention uses two sensors, namely a binocular camera and an ultrasonic radar, wherein the binocular camera can simultaneously output images of a left camera and a right camera, and the ultrasonic radar outputs information of a short-distance obstacle. The core processing comprises three parts, wherein the first part is used for carrying out target detection based on the improved YOLOv5 model; the second part is binocular correction, stereo matching and binocular ranging; the third part is to analyze the ultrasonic radar data; and finally, integrating the three parts of data to perform fusion calculation, and outputting sensing information such as type, confidence coefficient, three-dimensional coordinates, physical dimension length and width and the like. The system flow is shown in fig. 2.

The detection range is 1.2 m to 8.5 m based on an image vision model of binocular stereo vision and target detection. The ultrasonic radar can output the distance of the obstacle, and the detection range is 0.3 m to 2.5 m. After the image vision and the ultrasonic radar are fused, the perception information of the ultrasonic radar is used at a short distance, and the perception information of the image vision is used at a middle distance; and in the overlapping area of the two, reliable perception information is formed by adopting a priority and confidence strategy.

The fused ultrasonic information can make up a close-range blind area of the image vision model, and under the condition that weather conditions are poor or a camera is shielded, the ultrasonic radar detection information can provide close-range perception information, so that the system safety is improved.

A first part: improved YOLOv5 model

According to the original YOLOv5 framework analysis, the structure of YOLOv5 is mainly divided into four parts: input, Backbone network, heck Neck network and Head output, as shown in fig. 3.

The YOLOv5 network structure component diagram is shown in fig. 4:

the Input mainly contains preprocessing of data, including Mosaic data enhancement, adaptive image filling, and to accommodate different data sets, YOLOv5 integrates adaptive anchor frame calculation at the Input to automatically set the initial anchor frame size when a data set is changed.

The Backbone network of the backhaul extracts features of different levels from an image through deep convolution operation, and mainly utilizes a Bottleneck cross-stage local structure Bottleneck CSP and a spatial pyramid pooling SPP, wherein the purpose of the Bottleneck cross-stage local structure CSP is to reduce the calculated amount and improve the reasoning speed, and the purpose of the Bottleneck cross-stage local structure SPP is to extract features of different scales from the same feature map, so that the detection precision is improved.

The Neck network layer comprises a feature pyramid FPN and a path aggregation structure PAN, the FPN transmits semantic information from top to bottom in the network, the PAN transmits positioning information from bottom to top, information of different network layers in the backhaul is fused, and detection capability is further improved.

The Head output end serves as a final detection part and is mainly used for predicting targets with different sizes on feature maps with different sizes.

The improvement points are as follows:

1. according to the invention, an SE module is added in a backbone network of YOLOv5, the characteristic information aiming at the channel is screened, and the characteristic expression capability is improved.

2. In order to improve the recognition effect of the detector and solve the problem of low vehicle recognition rate, the attention mechanism is fused with the detection network, namely the attention module CBAM is fused with the Neck part of the YOLOv5 network, so that the capability of extracting features of the model is enhanced, and the model focuses more on the detected target.

3. According to the invention, the CIOU loss function is used for replacing the regression loss function of the original detection frame, so that the problems of low positioning precision and low regression speed of the target detection frame in the training process are solved.

4. The invention replaces the original NMS with the DIOU-NMS, improves the missing detection phenomenon caused by the shielding of the target and improves the target detection precision under the shielding condition.

5. The invention uses Hardswish activating function to replace the original SiLU activating function after convolution operation, and has the advantages of good numerical stability and high calculating speed.

According to the invention, an SE module is added in a backbone network of YOLOv5, the characteristic information aiming at the channel is screened, and the characteristic expression capability is improved.

And an SE (Squeeze-and-Excitation) module, which adopts a one-dimensional vector with the same number as the channels as the evaluation score of each channel, and then applies the evaluation scores to the corresponding channels respectively to process the output feature mapping. The SE module mainly has the effects of learning the correlation among the channels, screening out the characteristic information aiming at the channels and improving the characteristic expression capacity.

An SE module is added behind the SPP module in the backbone network, and the improved backbone network and component structure are shown in fig. 5.

Image data of input W H C is set, wherein W is the image width, H is the image height, and C is the number of image channels.

The SE module firstly performs a compression (Squeeze) operation on the input, wherein the process is a global average pooling operation, and the feature map becomes a vector of 1x1xC after the global average pooling operation;

and then carrying out Excitation (Excitation) operation on the vector of 1x1xC, wherein the input vector of 1x1xC is changed into 1x1xCxSERadio through a full connection layer, the SERadio is a scaling factor and plays a role in reducing the number of channels to optimize calculated quantity, and then obtaining 1x1xC and an activation function through the full connection layer again through the activation function to obtain 1x1 xC.

The invention fuses the attention module CBAM and the Neck part of the YOLOv5 network

In CNN networks, attention is acted on the feature map for obtaining the attention information available in the feature map, mainly including spatial attention and channel attention information. The Convolutional Attention Module (CBAM) focuses on both spatial and channel information, and reconstructs a feature map in the middle of the network through a channel Attention Module cam (channel Attention Module) and a spatial Attention Module sam (spatial Attention Module), emphasizes important features, suppresses general features, and achieves the purpose of improving the target detection effect.

The feature map obtained through the network is input to the CBAM module, the CBAM module can be divided into two parts, the input feature map is firstly convoluted and then is sent to a channel attention module in the CBAM module, then the feature map is subjected to feature adjustment through a space attention module, and finally the feature map and the input feature map are subjected to feature adjustment to obtain the output of the whole module.

In the convolution operation of YOLOv5, a layer of output three-dimensional feature graph F is provided, wherein the number of channels C, the height H and the width W are as follows, i.e. F is equal to R^CxHxW. The CBAM deduces the one-dimensional channel attention Mc and the two-dimensional space attention Ms from the feature map F in sequence, and the element-by-element multiplication is carried out respectively to finally obtain an output feature map with the F channel dimension.

Setting M_c(F) Representing the CAM module to carry out channel attention reconstruction on the feature map and outputting an F' feature map; m_s(F ') represents the spatial attention reconstruction of the channel attention output F' by the SAM module;

representing element-by-element multiplication; is formulated as follows:

the structure of the convolution attention module CBAM, the channel attention module CAM and the spatial attention module SAM is shown in FIG. 6.

And the channel attention module CAM is used for simultaneously carrying out maximum pooling and average pooling on each channel of the input feature map F, carrying out element-by-element addition on the feature vectors output by the Multi-Layer Perceptron (MLP) to obtain intermediate vectors, carrying out Sigmoid activation operation, taking a scaling factor through a Sigmoid activation function, multiplying the scaling factor by the input feature map to obtain the output of the space attention module, and finally obtaining the channel attention feature map F'. In order to reduce the amount of calculation, the multi-layer perceptron MLP designs only one hidden layer.

And the space attention module SAM is used for respectively carrying out maximum pooling and average pooling on the channel attention module output F 'along the channel direction, splicing the two operated outputs, acquiring a scaling factor of the space attention module through a convolution and sigmoid activation function, multiplying the scaling factor of the space attention module and the output of the channel attention module to obtain the output of the space attention module, and finally obtaining a space attention feature map F'.

Finally, the outputs F ', F' of the two module groups are added with the input of the CBAM module to obtain the new characteristics of the output of the whole CBAM module.

The most important operation of the attention mechanism is to highlight important information in the characteristic diagram and suppress general information; the most critical part of the extracted features in the YOLOv5 network is in the Backbone network Backbone, therefore, the invention fuses the CBAM module in the partial output of the Backbone network Backbone and before the features of the Neck network are fused, the feature extraction is finished in the Backbone network in the design consideration, the output is predicted on different feature maps after the Neck feature fusion, the CBAM module carries out attention reconstruction at the position, the function of starting from the beginning can be played, and the structure after the concrete improvement is shown in fig. 7.

According to the invention, the CIOU loss function is used for replacing the regression loss function of the original detection frame, so that the problems of low positioning precision and low regression speed of the target detection frame in the training process are solved.

In the aspect of a loss function, the regression of the frame information adopts a CIOU loss function. The IOU penalty function takes into account the overlapping area of the detection box and the target box. The GIOU loss function solves the problem when the bounding boxes are not overlapped on the basis of the IOU. The DIOU loss function considers the information of the bounding box center distance based on the IOU. The CIOU loss function considers the scale information of the width-to-height ratio of the bounding box on the basis of the DIOU.

The GIOU calculates the area Ac of the intersection part of the two bounding boxes, namely the area of the minimum box which simultaneously comprises the prediction box and the real box; then obtaining an IOU through a union set of the two bounding boxes; finding out the area U of the intersected part of the two boundary frames, which is not in the area U of the two boundary frames and accounts for the proportion of the whole intersected part; finally, the initial IOU is subtracted from this ratio to obtain GIOU, which is expressed as follows:

Loss_GIOU＝1-GIOU

Loss_GIOUis a GIOU loss function, the GIOU has a symmetrical interval and a value range of [ -1,1 [ -1 [ ]]. The maximum value is 1 when the two coincide, and the minimum value is-1 when the two do not intersect and are infinitely far. The GIOU focuses not only on the overlapping region but also on other non-overlapping regions, and can better reflect the degree of coincidence between the two regions. However, in the loss in bounding box regression, GIOU does not consider information on the bounding box center distance and scale information on the bounding box aspect ratio.

The CIOU can simultaneously consider the overlapping area of the detection frame and the target frame, the center distance of the boundary frame and the width-height ratio of the boundary frame, accelerate the regression speed of the target detection frame in the training process and improve the positioning precision of the boundary frame. The CIOU formula is as follows:

b is a predicted bounding box, b^gtFor true bounding boxes, p²(b,b^gt) Representing the euclidean distances of the center points of the predicted frame and the real frame, respectively. c represents the diagonal distance of the minimum closure area that can contain both the prediction box and the real box. α is a weighting function, v is used to measure the similarity of the aspect ratios, and the formula for α and v is as follows:

the CIOU loss equation is as follows:

Loss_CIOU＝1-CIOU

The invention replaces the original NMS with the DIOU-NMS, improves the missed detection phenomenon caused by the shielding of the target and improves the precision of the target detection under the shielding condition.

In the conventional NMS, the IOU is commonly used to suppress the redundant detection boxes, but since the IOU performs analysis only through the overlapping region, error suppression is easily generated for the case of target occlusion, resulting in the case of missed detection. In order to solve the problem, the invention adopts a DIOU-NMS (Distance-IOU non-extreme value inhibition method) to modify the IOU in the NMS into the DIOU, analyzes the overlapping area in the inhibition criterion, and calculates the Distance between the central points of two rectangular frames, thus being more suitable for the target detection task under the condition of road traffic scene occlusion.

R_DIOUrefers to the result obtained by DIOU-NMS processing, the input of which is s_iAnd B_iWhere i refers to the ith in a set for iterative computation; s_iIs the classification score; ε is the NMS threshold. Highest scoring prediction box M and other boxes B_iThe IOU-DIOU value of (B) is relatively small, B_iScore s of (2)_iStill remaining, otherwise, when IOU-DIOU is greater than NMS threshold value, s_iThe value is set to 0, i.e., filtered out. For two rectangular frames with farther center points, it is possible to locate on different objects and therefore not delete them directly. The DIOU-NMS deletes the candidate box B by analyzing the IOU of the two rectangular boxes and the distance between the center points_iThe accuracy of target detection under the shielding condition can be improved.

Hardswish is an activation function, uses piecewise linear simulation, and has the advantages of good numerical stability and high calculation speed. The Hardswish activation function formula is defined as follows.

Where x is an input value.

And others:

1. the classification loss adopts a binary cross entropy loss function.

2. The prior frame scale is extracted by using K-means clustering, the number of class centers of the K-means clustering is set to be K, K prior frames Anchor Box are taken, important hyperparameters K are automatically acquired in training, and the algorithm is also a large-characteristic self-adaptive Anchor.

3. In the optimization process, L2 regularization is adopted for parameters of the model, the L2 regularization enables the weight to be attenuated, the output of the weight is reduced after the attenuation, and therefore the processed network cannot over-fit each value, and the function of preventing over-fitting is achieved.

4. In the aspect of learning rate, the learning rate of the model is adjusted at intervals, so that the learning rate of the model is larger, the model can obtain the effect of fast convergence, the model is more fast to stabilize, the smaller learning rate is changed when the model tends to be stable, the convergence of the model is slowed down, the phenomenon of overfitting is prevented, and the stable structure of the model at a deeper level is maintained.

The network structure of the improved YOLOv5 is shown in fig. 8.

The component structure of the improved YOLOv5 is shown in fig. 9.

A second part: binocular stereo vision

Calculating the parallax of the left camera image and the right camera image based on binocular stereo vision to obtain a depth map; and then, combining the target position, the category and the confidence coefficient in the two-dimensional image, calculating information such as the distance, the spatial coordinate, the physical size and the like of the detected target, thereby enriching perception information and improving reliability.

Through binocular calibration, the internal reference of the left camera, the internal reference of the right camera and the rotation parameters, the translation parameters, the tangential distortion and the radial distortion between the two cameras are obtained.

Through binocular correction, lens deformation is eliminated, and a stereo camera pair is converted into a standard form; so that the two images are in the same object, the two images have the same size and are horizontally in a straight line. The binocular correction mainly comprises 4 parts, and is implemented by firstly inputting an original image, then acquiring calibration parameters such as tangential distortion and radial distortion and the like, and eliminating distortion. And performing binocular correction through an algorithm program, and finally cutting the image to obtain the image in a standard form.

The binocular stereo Matching is to find corresponding points of left and right camera images, and the stereo Matching is carried out based on an SGBM algorithm, which is called Semi-Global Block Matching in a whole and is a Semi-Global Matching algorithm. The flow mainly comprises 4 parts, namely, firstly calculating the matching cost, and calculating the matching cost between two pixels in a left image and a right image; the greater the matching cost, the lower the probability that the two pixels are represented as corresponding points. And then carrying out cost aggregation and parallax calculation, and finally optimizing the parallax to generate a parallax map.

And (3) binocular distance measurement, wherein a disparity map, a base line and a focal length are given, and the corresponding position in world coordinates is calculated through triangulation, namely the distance Z is obtained. The distance principle is calculated as shown in fig. 10.

Parallax: d ═ x_l-x_r

Similar to the triangle principle:

wherein f is the focal length, which refers to the distance between the sensor and the lens; d is parallax, meaning that the same spatial point is at the left camera pixel point (x)_l,y_l) And the corresponding point (x) in the right camera_r,y_r) The difference of the corresponding x-coordinates. T refers to the distance between the lenses of the two cameras.

And a third part: resolving ultrasonic radar data

The ultrasonic radar realizes short-distance measurement according to the principle of radar wave emission-rebound; and after the data is processed by the development mainboard, outputting the data of the real-time distance of the radar probe, and encoding the data. And after decoding, real-time ranging information of each ultrasonic radar.

The received data range is determined to be between 30 and 250cm in consideration of errors, the magnitude of transmission power, sensitivity and the like. Distances of less than 30cm or 0 are shown as 30cm, and distances of 250cm or more or no object are shown as 250 cm.

The ultrasonic radar adopts 48-woven signal shielding wires and has the characteristic of strong anti-interference capability; 12V direct current power supply; center frequency 40 kHz; the emission sound pressure is more than 105dB @30cm/10V sine wave; the receiving sensitivity is more than-82 dB/v/mu bar; the detection angle is 80deg.

According to the target detection method based on the improved YOLOv5 and binocular stereoscopic vision, an SE module is added to a backbone network of YOLOv5, the characteristic information aiming at the channel is screened, and the characteristic expression capability is improved. In order to improve the recognition effect of the detector and aim at the problem of low vehicle recognition rate, an attention mechanism is fused with a detection network, namely an attention module CBAM is fused with a Neck part of a YOLOv5 network, the capability of the model for extracting features is enhanced, and the model focuses more on the detected target. The CIOU loss function is used for replacing the regression loss function of the original detection frame, so that the problems of low positioning precision and low regression speed of the target detection frame in the training process are solved. And replacing the original NMS by using the DIOU-NMS to improve the missed detection phenomenon caused by the shielding of the target and improve the target detection precision under the shielding condition. Calculating the parallax of the left camera image and the right camera image based on binocular stereo vision to obtain a depth map; and then, combining the target position, the category and the confidence coefficient in the two-dimensional image, calculating information such as the distance, the spatial coordinate, the physical size and the like of the detected target, thereby enriching perception information and improving reliability. 8 ultrasonic radar are installed around the vehicle to detect whether objects exist in a short distance or not, so that a blind area of a binocular stereoscopic vision in a short distance is made up, and safety is improved. The actual effect is shown in fig. 11-14.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A target detection method based on improved YOLOv5 and binocular stereo vision is characterized by comprising the following steps:

2. The method for detecting an object based on improved YOLOv5 and binocular stereo vision according to claim 1, wherein the YOLOv5 is divided into four parts, including an Input, a Backbone network of Backbone, a Neck network, and a Head output;

the Neck network layer comprises a feature pyramid FPN and a path aggregation structure PAN, wherein the FPN transmits semantic information from top to bottom in the network, and the PAN transmits positioning information from bottom to top, so that information of different network layers in the backhaul is fused, and the detection capability is further improved;

3. The target detection method based on improved YOLOv5 and binocular stereo vision according to claim 1, wherein adding an SE module to a YOLOv5 backbone network specifically comprises:

the SE module adopts a one-dimensional vector with the same number as the channels as the evaluation score of each channel, and then applies the evaluation score to the corresponding channels respectively to process the output feature mapping; the SE module is used for learning the correlation among the channels, screening out the characteristic information aiming at the channels and improving the characteristic expression capacity;

adding an SE module behind an SPP module in a backbone network;

4. The method for detecting targets based on improved YOLOv5 and binocular stereovision according to claim 1, wherein the fusing of the attention module CBAM with the Neck part of the YOLOv5 network is specifically:

in the CNN network, an attention mechanism is used on the feature map for acquiring available attention information in the feature map, including spatial attention and channel attention information; the convolution attention module CBAM pays attention to space and channel information at the same time, and reconstructs a characteristic diagram in the middle of the network through the channel attention module CAM and the space attention module SAM, so as to emphasize important characteristics, inhibit general characteristics and achieve the purpose of improving the target detection effect;

representing element-by-element multiplication; is formulated as follows:

5. The target detection method based on the improved YOLOv5 and the binocular stereo vision as claimed in claim 1, wherein the replacing of the original detection frame regression loss function with the CIOU loss function is specifically as follows:

Loss_GIOU＝1-GIOU

Loss_GIOUis a GIOU loss function, the GIOU has a symmetrical interval and a value range of [ -1,1 [ -1 [ ]](ii) a Taking a maximum value of 1 when the two coincide, and taking a minimum value of-1 when the two do not intersect and are infinite; the GIOU not only focuses on the overlapping region, but also focuses on other non-overlapping regions, and can better reflect the contact ratio of the overlapping region and the non-overlapping region; however, in the loss in the regression of the bounding box, GIOU does not consider the information of the center distance of the bounding box and the scale information of the aspect ratio of the bounding box;

b is a predicted bounding box, b^gtAs a true bounding box, p²(b，b^gt) Respectively representing Euclidean distances of central points of the prediction frame and the real frame; c represents the diagonal distance of the minimum closure area which can contain the prediction box and the real box at the same time; α is a weighting function, v is used to measure the similarity of the aspect ratios, and the formula for α and v is as follows:

the CIOU loss equation is as follows:

Loss_CIOU＝1-CIOU

6. The method for object detection based on improved YOLOv5 and binocular stereo vision according to claim 1, wherein replacing the original NMS with DIOU-NMS is specifically:

R_DIOUrefers to the result obtained by DIOU-NMS processing, the input of which is s_iAnd B_iWhere i refers to the ith in a set for iterative computation; s_iIs the classification score; ε is the NMS threshold; highest scoring prediction box M and other boxes B_iThe IOU-DIOU value of (B) is relatively small, B_iScore value s of_iStill remaining, otherwise, when IOU-DIOU is greater than NMS threshold value, s_iThe value is set to 0, i.e., filtered out; for two rectangular frames with farther central points, the rectangular frames may be located on different objects, and therefore the rectangular frames cannot be directly deleted; the DIOU-NMS deletes the candidate box B by analyzing the IOU of the two rectangular boxes, and the distance between the center points_iAnd the accuracy of target detection under the shielding condition is improved.

7. The method for detecting targets based on improved YOLOv5 and binocular stereo vision as claimed in claim 1, wherein the Hardswish activation function formula is defined as follows

Where x is an input value.

8. The target detection method based on the improved YOLOv5 and the binocular stereo vision according to claim 1, wherein the left camera internal reference, the right camera internal reference, the rotation parameter between the two cameras, the translation parameter, the tangential distortion and the radial distortion are obtained through binocular calibration;

through binocular correction, lens deformation is eliminated, and a stereo camera pair is converted into a standard form; enabling the two images to be in the same object, wherein the two images have the same size and are horizontally arranged on the same straight line; the binocular correction mainly comprises 4 parts, firstly, an original image is input, then calibration parameters such as tangential distortion and radial distortion are obtained, and distortion is eliminated; and carrying out binocular correction through an algorithm program, and finally cutting the image to obtain the image in a standard form.

9. The target detection method based on the improved YOLOv5 and binocular stereo vision according to claim 1, wherein the binocular stereo matching is to find the corresponding points of the left and right camera images, and is a semi-global matching algorithm by using an SGBM-based algorithm to perform stereo matching; firstly, calculating matching cost, namely calculating the matching cost between two pixels in a left image and a right image; the larger the matching cost is, the lower the possibility that the two pixels are corresponding points is; and then carrying out cost aggregation and parallax calculation, and finally optimizing the parallax to generate a parallax map.

10. The method for detecting targets based on improved YOLOv5 and binocular stereovision according to claim 1, wherein, binocular distance measurement is performed, given a disparity map, a base line and a focal length, the corresponding position in world coordinates is calculated through triangulation, that is, the distance Z is obtained;

parallax: d ═ x_l-x_r

Similar to the triangle principle:

where f is the focal length, from sensor to lensThe distance of (a); d is parallax, meaning that the same spatial point is at the left camera pixel point (x)_l，y_l) And the corresponding point (x) in the right camera_r，y_r) The difference of the corresponding x-coordinates; t refers to the distance between the lenses of the two cameras.