CN116958910A - Attention mechanism-based multi-task traffic scene detection algorithm - Google Patents

Attention mechanism-based multi-task traffic scene detection algorithm Download PDF

Info

Publication number
CN116958910A
CN116958910A CN202310696843.9A CN202310696843A CN116958910A CN 116958910 A CN116958910 A CN 116958910A CN 202310696843 A CN202310696843 A CN 202310696843A CN 116958910 A CN116958910 A CN 116958910A
Authority
CN
China
Prior art keywords
module
feature
convolution
network
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310696843.9A
Other languages
Chinese (zh)
Inventor
曲建创
李成莪
王雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cccc Huakong Tianjin Technology Development Co ltd
Original Assignee
Cccc Huakong Tianjin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cccc Huakong Tianjin Technology Development Co ltd filed Critical Cccc Huakong Tianjin Technology Development Co ltd
Priority to CN202310696843.9A priority Critical patent/CN116958910A/en
Publication of CN116958910A publication Critical patent/CN116958910A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/54Surveillance or monitoring of activities, e.g. for recognising suspicious objects of traffic, e.g. cars on the road, trains or boats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/588Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a multitasking traffic scene detection algorithm based on an attention mechanism, which comprises a shared encoder and three decoders; the shared encoder consists of a backbone network and a neck network; the three decoders respectively finish the tasks of traffic targets, drivable areas and lane line detection. According to the invention, the large convolution kernel attention mechanism and the ELAN structure proposed by YOLOv7 are fused for the first time to serve as a brand-new main network, the large convolution kernel attention mechanism and the multi-scale information fusion mechanism are combined to provide a segmentation enhancement module for the lane line segmentation task, and the module is added after the main network and before the lane line detection segmentation head, so that the accuracy of the lane line detection task is further improved.

Description

Attention mechanism-based multi-task traffic scene detection algorithm
Technical Field
The invention relates to the technical field of multi-task traffic scene detection, in particular to a multi-task traffic scene detection algorithm based on an attention mechanism.
Background
In the past decade, tremendous advances have been made in the areas of computer vision and deep learning, but vision-based tasks (e.g., vehicle object detection, drivable area detection, lane line detection, etc.) remain challenging in low-cost, high-precision traffic scene applications. In recent years, the method based on multitasking learning exhibits excellent performance on traffic scene perception problems, and provides a high-precision and high-efficiency solution. The target detection plays an important role in providing the position and size information of traffic barriers and helping automatic driving vehicles and road monitoring personnel to make accurate and timely decisions; the detection of the drivable area and the division of the lane lines provide rich information for route planning and driving safety. Therefore, the research of traffic target detection, drivable area detection and lane line detection tasks is very critical.
Each of these three tasks typically represent networks including, but not limited to, SSD series, R-CNN series, YOLO series, etc. networks for object detection; a network such as an ENT or PSPNet for detecting a traveling area; SAD-ENT, SCNN, etc. networks for lane line segmentation. Although the above networks can well realize traffic target detection, drivable area and lane line segmentation, a great delay exists when images sequentially pass through three cascaded networks, so that task processing time is prolonged.
The invention patent No. CN202211141578.X provides a method and a system for sensing multi-task panoramic driving based on improved YOLOv 5. Firstly, carrying out picture preprocessing on images in a data set to obtain input images; extracting the characteristics of the input image by using a trunk network of improved YOLOv5 to obtain a characteristic diagram; the backbone network is obtained by replacing a C3 module in the YOLOv5 backbone network with an inversion residual bottleneck module; inputting the feature map into a neck network to obtain a feature map, and fusing the feature map with the feature map obtained by a backbone network; inputting the fusion characteristic diagram to a detection head for traffic target detection; and inputting the characteristic diagram of the neck network into a branch network, and detecting lane lines and dividing a travelable area.
However, the traffic scene detection algorithm is based on a convolutional neural network, a concentration mechanism is lacked, the network cannot concentrate on important input information, and the conventional neural network is based on a small convolutional kernel, the small convolutional kernel has a smaller receptive field, and global information of an object cannot be obtained, so that the algorithm effect is poor.
Disclosure of Invention
The invention aims to further improve the precision of a traffic scene multitasking sensing model and provides a multitasking traffic scene detection algorithm based on an attention mechanism.
The invention adopts the following technical scheme to realize the aim:
a multitasking traffic scene detection algorithm based on attention mechanism, comprising a shared encoder and three decoders; the shared encoder consists of a backbone network and a neck network; the three decoders respectively finish the traffic target, the drivable area and the lane line detection tasks;
the backbone network is used for extracting the characteristics of an input image and comprises a convolution module, a characteristic extraction module and a downsampling module, wherein the convolution module consists of a Conv convolution layer, a BatchNorm batch normalization layer and a SiLU activation function; the feature extraction module fuses a large-core attention mechanism and an ELAN structure, and builds a backbone network to extract features; the downsampling module adds a downsampling layer on the basis of the convolution module to form two branches, and finally performs feature fusion through dimension addition, wherein the number of channels of the output feature map is 2 times that of the input feature map, and the length and the width of the output feature map are 1/2 of that of the input feature map;
the neck network comprises a cross-stage space pyramid pooling module, a characteristic pyramid network and a path aggregation network; the cross-stage space pyramid pooling module is used for expanding receptive fields, fusing information of feature graphs with different scales and finishing feature fusion; in the feature map transmission process, a deep feature map carries strong semantic features and weak position information, a shallow feature map carries strong position information and weak semantic features, a feature pyramid network transmits the deep semantic features to the shallow layer, so that semantic expression on multiple scales is enhanced, a path aggregation network transmits the shallow position information to the deep layer, and positioning capability on multiple scales is enhanced;
the specific process of the detection algorithm is as follows; the input of the network is 640 x 3 RGB pictures, firstly, the RGB pictures enter a convolution module to conduct feature transfer, the length and the width of the feature pictures of the 2 nd convolution module and the 4 th convolution module are respectively reduced by 1/2, the length and the width of the output feature pictures are 1/4 of the input, the feature pictures enter a feature extraction module and a downsampling module to conduct feature extraction, the length and the width of the output feature pictures are reduced to 1/32 of an original image from 1/4 after three downsampling, then the extracted feature pictures are sent to a neck network to conduct multi-scale feature fusion, the traffic target detection module transmits the feature pictures to three traffic target detection heads with different sizes, finally, three feature pictures with the sizes of (W/8,H/8,256), (W/16, H/16,512), (W/32, H/32,1024) are respectively output, the input sizes of a drivable region detection module and a lane line detection module are (W/8,H/8,128), the drivable region detection module comprises a Boleneck CSP module and three downsampling modules for feature extraction, after the input information transfer, the output size of the obtained feature pictures is (W/8,H) is the same as that the input of the post-detection module can be subjected to the segmentation of the road network, and the road detection module can be subjected to the following the road detection by the road detection module.
And selecting a backbone network of the YOLOv7 as a basic network structure, and replacing the original ELAN structure by a feature extraction module on the basis of the basic network structure to construct an improved backbone network of the YOLOv 7.
The ELAN structure is a high-efficiency layer aggregation network, can improve the network learning capacity under the condition of not damaging an original gradient path, and can learn more diversified features by guiding calculation blocks of different feature groups; because the LKA mechanism not only comprises the advantage that the self-attention mechanism can solve the problem of long-distance dependence, but also comprises the advantage that convolution can utilize local context information, the LKA mechanism is fused with an ELAN structure in YOLOv7 to form a feature extraction module;
the feature extraction Module comprises four convolution modules and two LKA-Module layers, wherein the input sequentially passes through the two convolution modules, the two convolution modules and the LKA-Module layer cascade structure, feature graphs with the number of channels o=i/2 are respectively output, wherein o is the number of output channels, i is the number of input channels, and finally dimension addition is carried out on the output feature graphs.
The feature extraction module comprises two forms, wherein one form is that the number of output channels of the first two convolution modules is 1/2 of the number of input channels, and the number of the input channels and the number of the output channels of the second two convolution modules are the same; the other form is that the number of output channels of the first two convolution modules is 1/4 of the number of input channels, and the number of input channels and the number of output channels of the second two convolution modules are the same.
The LKA-Module layer comprises a BatchNorm batch normalization layer, and an attention Module and a feedforward neural network Module in a Tranformer structure, wherein the attention Module and the feedforward neural network Module are in cascade connection to perform feature extraction; in order to prevent gradient explosion and accelerating model convergence, the input feature images are firstly subjected to batch normalization processing, and then enter an attention module and a feedforward neural network module;
the attention module consists of 1*1 convolution, GELU activation function and LKA module, wherein the LKA module is a large-core attention layer and helps the network to selectively learn input characteristics;
the feedforward neural network module is composed of 1*1 normal convolution, 3*3 depth expansion convolution and a GELU activation function, wherein the expansion rate of the depth expansion convolution is equal to 3.
And 7*7, 11 x 11 and 21 x 21 convolution kernels with different sizes are added into the segmentation enhancement module to perform multi-scale information interaction, and meanwhile, the convolution kernel of K is decomposed into a 1*K transverse convolution kernel and a K1 longitudinal convolution kernel, so that the computational complexity is further reduced.
Adding a gating mechanism in the segmentation enhancement module, and enabling the model to selectively learn important information features by recalibrating the weight sizes of different channels; from the aspect of data flow, an input feature map of a model is subjected to 1*1 common convolution firstly, then is subjected to 7*7, 11 x 11 and 21 x 21 deep convolution respectively to learn multi-scale information features, then the output feature map and an original input feature map are subjected to numerical addition to obtain a new output feature map, the feature map is multiplied by all channel weights subjected to global average pooling for adding an attention mechanism, the effect of selectively learning important channels is achieved, and a BatchNorm batch normalization layer and a ReLU activation function are added to prevent overfitting in the training process.
The beneficial effects of the invention are as follows: according to the invention, the large convolution kernel attention mechanism and the ELAN structure proposed by YOLOv7 are fused for the first time to serve as a brand-new main network, the large convolution kernel attention mechanism and the multi-scale information fusion mechanism are combined to provide a segmentation enhancement module for the lane line segmentation task, and the module is added after the main network and before the lane line detection segmentation head, so that the accuracy of the lane line detection task is further improved.
Drawings
FIG. 1 is a schematic diagram of the overall structure of the algorithm of the present invention;
FIG. 2 is a flow chart of the detection in the present invention;
FIG. 3 is a schematic diagram of one form of an LKA-ELAN module according to the present invention;
FIG. 4 is a schematic diagram of another form of an LKA-ELAN module according to the present invention;
FIG. 5 is a schematic diagram of an LKA-Module structure according to the present invention;
FIG. 6 is a schematic diagram of a SegMod module structure according to the present invention;
the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Detailed Description
The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention. The invention is more particularly described by way of example in the following paragraphs with reference to the drawings. The advantages and features of the present invention will become more apparent from the following description. It should be noted that the drawings are in a very simplified form and are all to a non-precise scale, merely for convenience and clarity in aiding in the description of embodiments of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
currently, many researchers have designed multi-task learning networks (MultiNet, DLT-Net, YOLOP) in which the encoder-decoder architecture is used, with the decoders of the three detection tasks sharing the same encoder. MarvinTeichmann et al introduce the concept of multiplexing into the field of traffic scene perception for the first time in a MultiNet network, and the method uses a VGG16 backbone structure as an encoder, then performs feature fusion on a feature map generated by the encoder, and finally sends the feature map to ClassificationDecoder, detectionDecoder, segmentationDecoder decoders to complete three tasks of target classification, target detection and lane line detection. Qian et al first determined the detection tasks as traffic target detection, travelable region detection and lane line detection in the DLT-Net network, and provided a context tensor to share the information of DrivableAreaDecoder to TrafficObjectDecoder and LanelineeDecoder, which significantly improved the overall performance without increasing the computational overhead. Wu and the like firstly introduce a YOLO series network into a multi-task algorithm, take YOLOv5 as a main structure to complete a target detection task, use an ene network decoding structure to acquire a characteristic diagram of lane line detection and drivable region detection, further realize the weight reduction and portability of a model, and bring multi-task learning in the traffic scene perception field into the YOLO era. Although many excellent networks are proposed, the detection accuracy and other indexes of the proposed algorithm are required to be improved.
In order to further improve the precision of a traffic scene multi-task perception model, after the method is studied in depth, the invention provides a multi-task traffic scene detection algorithm based on an attention mechanism, which comprises a shared encoder and three decoders; the shared encoder consists of a Backbone network (Backbone) and a Neck network (neg); the three decoders respectively finish the traffic target, the drivable area and the lane line detection tasks;
the backbone network is used for extracting the characteristics of an input image and comprises a convolution module (CBS), a characteristic extraction module (LKA-ELAN) and a downsampling Module (MP), wherein the CBS consists of a Conv convolution layer, a BatchNorm batch normalization layer and a SiLU activation function; the LKA-ELAN fuses a large-core attention mechanism (largekernel attention, LKA) and an ELAN structure, and builds a backbone network to extract features; MP adds down sampling layer (MaxPooling) on the basis of CBS to form two branches, and finally performs feature fusion by dimension addition, wherein the number of channels of the output feature map is 2 times that of the input feature map, and the length and width of the output feature map are 1/2 of that of the input feature map;
the neck network includes a cross-phase spatial pyramid pooling module (spacial pyramidPoolingCross StagePartial, SPPCSP), a feature pyramid network (FeaturePyramidNetworks, FPN), and a path aggregation network (PathAggregationNetworks, PAN);
SPPCSP has the functions of expanding receptive fields, fusing information of feature graphs with different scales and completing feature fusion; in the feature map transmission process, the deep feature map carries stronger semantic features and weaker position information, and the shallow feature map carries stronger position information and weaker semantic features; the FPN transmits the deep semantic features to the deep layer, so that semantic expression on multiple scales is enhanced, the PAN transmits the position information of the deep layer to the deep layer, and the positioning capability on multiple scales is enhanced;
the specific process of the detection algorithm is as shown in fig. 1 and fig. 2, the input of the network is 640 x 3 RGB pictures, firstly, the CBS is entered for feature transfer, the length and width of the 2 nd and 4 th CBS feature maps are respectively reduced by 1/2, the length and width of the output feature maps are 1/4 of the input, the feature maps enter LKA-ELAN and MP for feature extraction, the length and width of the output feature maps are reduced from 1/4 of the original image to 1/32 of the original image after three downsampling, then the extracted feature maps are sent to the neg for multi-scale feature fusion, the traffic target detection module transmits the feature maps to three traffic target detection heads with different sizes, finally, the three feature maps with the sizes of (W/8,H/8,256), (W/16, h/16,512), (W/32, h/32,1024) are respectively output, the input sizes of the drivable area detection module and the lane line detection module are (W/8,H/8,128), the drivable area detection module comprises a bolleckkcsp module for feature extraction and three MPs, the input information is transmitted to the network after the large/small input module is the same as the large/small input area detection module (W/8,H), and the traffic area detection module is subjected to the subsequent semantic information is subjected to the enhancement of the network structure.
And selecting a backbone network of the YOLOv7 as a basic network structure, and replacing the original ELAN structure with LKA-ELAN on the basis of the basic network structure to construct an improved YOLOv7 backbone network.
The ELAN structure is a high-efficiency layer aggregation network, can improve the network learning capacity under the condition of not damaging an original gradient path, and can learn more diversified features by guiding calculation blocks of different feature groups; because the LKA mechanism not only comprises the advantage that the self-attention mechanism can solve the problem of long-distance dependence, but also comprises the advantage that convolution can utilize local context information, the LKA mechanism is fused with an ELAN structure in YOLOv7 to form an LKA-ELAN;
the LKA-ELAN comprises four CBS layers and two LKA-Module layers, wherein the input sequentially passes through the cascade structure of the two CBS layers, the two CBS layers and the LKA-Module layers, the characteristic diagrams with the channel number o=i/2 are respectively output, wherein o is the output channel number (OutputChannel) and i is the input channel number (InputChannel), and finally the output characteristic diagrams are subjected to dimension addition;
the LKA-ELAN only aggregates all the previous layers in the last layer of the structure, and the structure not only inherits the advantages of representing multiple characteristics by the multiple receptive fields of DenseNet, but also solves the problem of low dense connection efficiency, and meanwhile, compared with VoVNet, the structure adds a large-core attention mechanism, so that the network performance can be further improved.
The feature extraction module (LKA-ELAN) comprises two forms, wherein one form is that the number of output channels of the first two convolution modules (CBS) is 1/2 of the number of input channels, and the number of input and output channels of the second two convolution modules (CBS) is the same, as shown in fig. 3; another form is that the number of output channels of the first two convolution modules (CBS) is 1/4 of the number of input channels, and the number of input and output channels of the second two convolution modules (CBS) is the same, as shown in fig. 4.
As shown in fig. 5, similar to DETR and VAN algorithms, the LKA-Module layer contains a Attention Module (Attention) and a feed-forward neural network Module (FeedForwardNetwork, FFN) in a batch normalization layer and a Tranformer structure, and the Attention and FFN perform feature extraction by cascading; in order to prevent gradient explosion and accelerate model convergence, carrying out batch normalization processing on an input feature map, and then entering an Attention and FFN;
the Attention is composed of 1*1 convolution, GELU activation function and LKA module, wherein the LKA module is a large-core Attention layer and helps the network to selectively learn input characteristics;
FFN consists of 1*1 normal convolution, 3*3 depth dilation convolution, and gel activation function, where the dilation rate (d) of the depth dilation convolution is equal to 3.
The attention mechanism can be seen as an adaptive selection process that can select discriminating features based on input features and automatically ignore noise responses, the key step of the attention mechanism being the generation of attention feature maps, which can represent the importance of the different parts.
Currently, there are two methods to learn the relationship between different features.
The first is to employ a self-attention mechanism to capture long-range dependencies. Although the self-attention mechanism is very effective in natural language processing, it still has three drawbacks when processing computer vision tasks: 1) In the processing process, the images are regarded as one-dimensional sequences, and the two-dimensional structure of the images is ignored; 2) The calculation complexity of the method and the input resolution are in a quadratic increase relation, and the processing cost of the method for the high-resolution image is high; 3) It only achieves spatial adaptation, but ignores the adaptation of the channel dimension.
The second is the method employed by the present invention to construct the attention profile from large convolution kernels. As shown in fig. 4, since adding large convolution kernels (17×17, 21×21, etc.) to the network may cause the network computation amount to increase, which is unfavorable for increasing the model depth, the convolution kernel of k×k is replaced by the deep convolution of (2 d-1) ×2d-1, (K/d) ×k/d) and the normal convolution of 1*1 in the LKA module, where the deep convolution and the deep expansion convolution both use packet convolution, and the number of packets (groups) is equal to the number of input channels. Through the operation, the receptive field can be increased while the parameter quantity is reduced, so that more global features can be obtained, and then the input is multiplied by the output subjected to the large convolution kernel processing to add a attention mechanism, so that the input features can be better selectively learned.
In the multi-task traffic scene detection algorithm, two detection tasks are related to segmentation, namely a drivable region detection task and a lane line detection task. Although two downstream segmentation tasks are effectively improved after the large-core attention backbone network is replaced, the improvement amplitude of the lane line detection precision is smaller, so that a segmentation enhancement module comprising a large convolution core and a multi-scale information interaction mechanism is provided for improving the lane line segmentation effect.
In the process of comparing part of classical semantic segmentation models (deep LabV3+, SETR, segNeXt), a successful semantic segmentation model should be provided with a strong backbone network at first, and the specificity of a backbone network is shared by a plurality of detection tasks of a multi-task traffic scene detection algorithm in consideration of the fact that the segmentation performance of model lane lines is improved without change. Secondly, the method should have the characteristic of multi-scale information interaction, unlike the image classification task mainly identifying a single object, semantic segmentation is a dense prediction task, and detection objects with different sizes in a single image need to be processed, so that three convolution kernels with different sizes of 7*7, 11×11 and 21×21 are simultaneously added in a segmentation enhancement module to perform multi-scale information interaction, and meanwhile, the convolution kernel of k×k is decomposed into a transverse convolution kernel of 1*K and a longitudinal convolution kernel of k×1, so that the computational complexity is further reduced. Third, an attention mechanism should be provided to better select the input features.
Similar to SENet, a gating mechanism is added in the segmentation enhancement module, allowing the model to selectively learn important information features by recalibrating the weight sizes of the different channels. As shown in fig. 6, from the data flow perspective, the input feature map of the model is first subjected to 1*1 normal convolution, then respectively subjected to 7*7, 11×11, and 21×21 deep convolution to learn multi-scale information features, and then the output feature map is numerically added to the original input feature map to obtain a new output feature map. To add the attention mechanism, the feature map is multiplied by the weights of all channels subjected to Global Average Pooling (GAP), so as to achieve the effect of selectively learning important channels. During training, the addition of the BatchNorm batch normalization layer and the ReLU activation function prevents overfitting.
The invention fuses a large convolution kernel attention mechanism with the ELAN structure proposed by YOLOv7 for the first time to serve as a brand new backbone network.
Meanwhile, the invention combines a large-core attention mechanism and a multi-scale information fusion mechanism to provide a segmentation enhancement module aiming at the lane line segmentation task, and the module is added after the main network and before the lane line detection segmentation head to further improve the accuracy of the lane line detection task.
While the invention has been described above with reference to the accompanying drawings, it will be apparent that the invention is not limited to the above embodiments, but is intended to cover various modifications, either made by the method concepts and technical solutions of the invention, or applied directly to other applications without modification, within the scope of the invention.

Claims (7)

1. A multi-task traffic scene detection algorithm based on an attention mechanism is characterized in that,
comprises a shared encoder and three decoders; the shared encoder consists of a backbone network and a neck network; the three decoders respectively finish the traffic target, the drivable area and the lane line detection tasks;
the backbone network is used for extracting the characteristics of an input image and comprises a convolution module, a characteristic extraction module and a downsampling module, wherein the convolution module consists of a Conv convolution layer, a BatchNorm batch normalization layer and a SiLU activation function; the feature extraction module fuses a large-core attention mechanism and an ELAN structure, and builds a backbone network to extract features; the downsampling module adds a downsampling layer on the basis of the convolution module to form two branches, and finally performs feature fusion through dimension addition, wherein the number of channels of the output feature map is 2 times that of the input feature map, and the length and the width of the output feature map are 1/2 of that of the input feature map;
the neck network comprises a cross-stage space pyramid pooling module, a characteristic pyramid network and a path aggregation network; the cross-stage space pyramid pooling module is used for expanding receptive fields, fusing information of feature graphs with different scales and finishing feature fusion; in the feature map transmission process, a deep feature map carries strong semantic features and weak position information, a shallow feature map carries strong position information and weak semantic features, a feature pyramid network transmits the deep semantic features to the shallow layer, so that semantic expression on multiple scales is enhanced, a path aggregation network transmits the shallow position information to the deep layer, and positioning capability on multiple scales is enhanced;
the specific process of the detection algorithm is as follows; the input of the network is 640 x 3 RGB pictures, firstly, the RGB pictures enter a convolution module to conduct feature transfer, the length and the width of the feature pictures of the 2 nd convolution module and the 4 th convolution module are respectively reduced by 1/2, the length and the width of the output feature pictures are 1/4 of the input, the feature pictures enter a feature extraction module and a downsampling module to conduct feature extraction, the length and the width of the output feature pictures are reduced to 1/32 of an original image from 1/4 after three downsampling, then the extracted feature pictures are sent to a neck network to conduct multi-scale feature fusion, the traffic target detection module transmits the feature pictures to three traffic target detection heads with different sizes, finally, three feature pictures with the sizes of (W/8,H/8,256), (W/16, H/16,512), (W/32, H/32,1024) are respectively output, the input sizes of a drivable region detection module and a lane line detection module are (W/8,H/8,128), the drivable region detection module comprises a Boleneck CSP module and three downsampling modules for feature extraction, after the input information transfer, the output size of the obtained feature pictures is (W/8,H) is the same as that the input of the post-detection module can be subjected to the segmentation of the road network, and the road detection module can be subjected to the following the road detection by the road detection module.
2. The attention-based traffic scenario detection algorithm of claim 1, wherein,
and selecting a backbone network of the YOLOv7 as a basic network structure, and replacing the original ELAN structure by a feature extraction module on the basis of the basic network structure to construct an improved backbone network of the YOLOv 7.
3. The attention-based traffic scenario detection algorithm of claim 2, wherein,
the ELAN structure is a high-efficiency layer aggregation network, can improve the network learning capacity under the condition of not damaging an original gradient path, and can learn more diversified features by guiding calculation blocks of different feature groups; because the LKA mechanism not only comprises the advantage that the self-attention mechanism can solve the problem of long-distance dependence, but also comprises the advantage that convolution can utilize local context information, the LKA mechanism is fused with an ELAN structure in YOLOv7 to form a feature extraction module;
the feature extraction Module comprises four convolution modules and two LKA-Module layers, wherein the input sequentially passes through the two convolution modules, the two convolution modules and the LKA-Module layer cascade structure, feature graphs with the number of channels o=i/2 are respectively output, wherein o is the number of output channels, i is the number of input channels, and finally dimension addition is carried out on the output feature graphs.
4. The multi-tasking traffic scene detection algorithm based on an attention mechanism according to claim 3, wherein,
the feature extraction module comprises two forms, wherein one form is that the number of output channels of the first two convolution modules is 1/2 of the number of input channels, and the number of the input channels and the number of the output channels of the second two convolution modules are the same; the other form is that the number of output channels of the first two convolution modules is 1/4 of the number of input channels, and the number of input channels and the number of output channels of the second two convolution modules are the same.
5. The attention-based traffic scenario detection algorithm of claim 4, wherein,
the LKA-Module layer comprises a BatchNorm batch normalization layer, and an attention Module and a feedforward neural network Module in a Tranformer structure, wherein the attention Module and the feedforward neural network Module are in cascade connection to perform feature extraction; in order to prevent gradient explosion and accelerating model convergence, the input feature images are firstly subjected to batch normalization processing, and then enter an attention module and a feedforward neural network module;
the attention module consists of 1*1 convolution, GELU activation function and LKA module, wherein the LKA module is a large-core attention layer and helps the network to selectively learn input characteristics;
the feedforward neural network module is composed of 1*1 normal convolution, 3*3 depth expansion convolution and a GELU activation function, wherein the expansion rate of the depth expansion convolution is equal to 3.
6. The attention-based traffic scenario detection algorithm of claim 5, wherein,
and 7*7, 11 x 11 and 21 x 21 convolution kernels with different sizes are added into the segmentation enhancement module to perform multi-scale information interaction, and meanwhile, the convolution kernel of K is decomposed into a 1*K transverse convolution kernel and a K1 longitudinal convolution kernel, so that the computational complexity is further reduced.
7. The attention-based traffic scenario detection algorithm of claim 6, wherein,
adding a gating mechanism in the segmentation enhancement module, and enabling the model to selectively learn important information features by recalibrating the weight sizes of different channels; from the aspect of data flow, an input feature map of a model is subjected to 1*1 common convolution firstly, then is subjected to 7*7, 11 x 11 and 21 x 21 deep convolution respectively to learn multi-scale information features, then the output feature map and an original input feature map are subjected to numerical addition to obtain a new output feature map, the feature map is multiplied by all channel weights subjected to global average pooling for adding an attention mechanism, the effect of selectively learning important channels is achieved, and a BatchNorm batch normalization layer and a ReLU activation function are added to prevent overfitting in the training process.
CN202310696843.9A 2023-06-13 2023-06-13 Attention mechanism-based multi-task traffic scene detection algorithm Pending CN116958910A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310696843.9A CN116958910A (en) 2023-06-13 2023-06-13 Attention mechanism-based multi-task traffic scene detection algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310696843.9A CN116958910A (en) 2023-06-13 2023-06-13 Attention mechanism-based multi-task traffic scene detection algorithm

Publications (1)

Publication Number Publication Date
CN116958910A true CN116958910A (en) 2023-10-27

Family

ID=88445172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310696843.9A Pending CN116958910A (en) 2023-06-13 2023-06-13 Attention mechanism-based multi-task traffic scene detection algorithm

Country Status (1)

Country Link
CN (1) CN116958910A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117934473A (en) * 2024-03-22 2024-04-26 成都信息工程大学 Highway tunnel apparent crack detection method based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117934473A (en) * 2024-03-22 2024-04-26 成都信息工程大学 Highway tunnel apparent crack detection method based on deep learning
CN117934473B (en) * 2024-03-22 2024-05-28 成都信息工程大学 Highway tunnel apparent crack detection method based on deep learning

Similar Documents

Publication Publication Date Title
CN109035779B (en) DenseNet-based expressway traffic flow prediction method
CN113486726B (en) Rail transit obstacle detection method based on improved convolutional neural network
Han et al. Yolopv2: Better, faster, stronger for panoptic driving perception
Cai et al. Environment-attention network for vehicle trajectory prediction
CN111325976B (en) Short-term traffic flow prediction method and system
CN112560656A (en) Pedestrian multi-target tracking method combining attention machine system and end-to-end training
CN112417973A (en) Unmanned system based on car networking
CN113269133A (en) Unmanned aerial vehicle visual angle video semantic segmentation method based on deep learning
CN116958910A (en) Attention mechanism-based multi-task traffic scene detection algorithm
CN112489072B (en) Vehicle-mounted video perception information transmission load optimization method and device
Mukherjee et al. Interacting vehicle trajectory prediction with convolutional recurrent neural networks
CN114973199A (en) Rail transit train obstacle detection method based on convolutional neural network
CN115527096A (en) Small target detection method based on improved YOLOv5
Wang et al. Object detection algorithm based on improved Yolov3-tiny network in traffic scenes
Pang et al. Fast-HBNet: Hybrid branch network for fast lane detection
Jing et al. Lightweight Vehicle Detection Based on Improved Yolox-nano.
Li et al. E^ 2-pv-rcnn: improving 3d object detection via enhancing keypoint features
CN116704236A (en) Target detection method based on mixed attention mechanism
CN111639563B (en) Basketball video event and target online detection method based on multitasking
CN115331460A (en) Large-scale traffic signal control method and device based on deep reinforcement learning
Li et al. An improved lightweight network based on yolov5s for object detection in autonomous driving
CN112364720A (en) Method for quickly identifying and counting vehicle types
Adnan et al. Traffic congestion prediction using deep convolutional neural networks: A color-coding approach
Shao et al. Research on yolov5 vehicle object detection algorithm based on attention mechanism
Guo et al. An Effective Module CA-HDC for Lane Detection in Complicated Environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination