CN113887588A

CN113887588A - Vehicle detection method and device based on attention mechanism and feature weighting fusion

Info

Publication number: CN113887588A
Application number: CN202111094013.6A
Authority: CN
Inventors: 刘丽; 梁鹏; 雷雪梅; 邵立珍
Original assignee: University of Science and Technology Beijing USTB; Shunde Graduate School of USTB
Current assignee: University of Science and Technology Beijing USTB; Shunde Graduate School of USTB
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2022-01-04

Abstract

The invention discloses a vehicle detection method and a device based on attention mechanism and feature weighting fusion, wherein the method comprises the following steps: preprocessing an image to be detected; inputting the preprocessed image to be detected into a vehicle detection model which is trained in advance; generating a channel attention feature map and a space attention feature map by using a vehicle detection model and based on the preprocessed image to be detected and adopting a channel and space two-dimensional attention mechanism; performing differentiated feature fusion based on the channel attention feature map and the space attention feature map through a weighted bidirectional feature fusion network to obtain fusion features; and obtaining a detection result containing the position and size information of the vehicle based on the fusion characteristics. Compared with the existing vehicle detection technology, the method has the advantages that the detection speed is kept high, the parameter quantity is reduced, the detection precision is improved, and the detection effect on small-scale vehicles is particularly obvious.

Description

Vehicle detection method and device based on attention mechanism and feature weighting fusion

Technical Field

The invention relates to the technical field of vehicle target detection, in particular to a vehicle detection method and device based on attention mechanism and feature weighted fusion.

Background

And the vehicle detection means that in a given image or video sequence, image example and scene segmentation is carried out based on the geometric and statistical characteristics of a vehicle target, the size and position information of the vehicle in the image is determined, and the information is output in the form of a detection frame.

The vehicle detection method is mainly divided into a traditional algorithm based on manual characteristics and a detection method based on a deep neural network. In the traditional algorithm, feature extraction is carried out by means of manually designed features, the quality of the manually designed features directly determines the precision of target detection, and errors such as missing detection and error detection are easy to occur in road monitoring scenes with complex and changeable environments and low video resolution. In addition, the limitations of poor portability, high computational redundancy, and the like are also accompanied.

The deep learning-based method does not need to manually extract features, original input data are converted into deeper and more abstract features through a nonlinear model, higher-level visual information is mined, and the method has extremely strong learning representation capability, so that the method gradually becomes a mainstream method in vehicle target detection. The vehicle detection under the road monitoring video scene has the problems of complex and various scenes, serious interference, small target size, large size change and the like. The existing vehicle target detection method based on the deep neural network has poor effect on long-distance and small-target vehicles and weak generalization capability. Common vehicle detection methods are greatly influenced by scene changes, are not strong in applicability to road scenes, and are difficult to achieve balance among detection speed, model size and detection accuracy, such as a high-speed Region Convolutional Neural Network (fast R-CNN), candidate regions are generated through image segmentation, and then classification regression operation is performed on the candidate regions, so that accurate detection and classification are achieved, but the detection speed is low; the YOLO series algorithm directly regresses the target category position, the detection speed is high, but the detection precision is sacrificed to a certain extent, the image feature extraction is insufficient, in the network convolution process, along with the deepening of the convolution, the feature of a large target is easy to reserve, the feature of a small target is easier to ignore later, and the detection effect on the small target is poor. In addition, the neural network parameters are too many, so that the training time is prolonged, the model parameters are large, and the transportability is poor.

Disclosure of Invention

The invention provides a vehicle detection method and device based on attention mechanism and feature weighting fusion, and aims to solve the technical problems of low precision and large model parameter quantity in the existing vehicle detection technology.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the invention provides a vehicle detection method based on attention mechanism and feature weighted fusion, which includes:

preprocessing an image to be detected to obtain a preprocessed image to be detected;

inputting the preprocessed image to be detected into a vehicle detection model which is trained in advance;

generating a channel attention feature map and a space attention feature map by using the vehicle detection model and a channel and space two-dimensional attention mechanism based on the preprocessed image to be detected;

performing differentiated feature fusion based on the channel attention feature map and the space attention feature map through a weighted bidirectional feature fusion network to obtain fusion features;

and obtaining a detection result containing the position and size information of the vehicle based on the fusion characteristics.

Further, the preprocessing is to perform normalization processing on the image to be detected so as to unify the specification and size of the image.

Further, the vehicle detection model includes: the system comprises a feature extraction layer, a feature fusion layer, a convolution block attention mechanism module, a spatial pyramid pooling layer, a convolution layer, a target detection layer and an activation function; wherein the feature extraction layer comprises a slicing operation; the target detection layer comprises three different size scales; the convolution block attention module includes a channel dimension attention module and a spatial dimension attention module.

Further, generating a channel attention feature map and a space attention feature map by using a channel and space two-dimensional attention mechanism based on the preprocessed image to be detected by using the vehicle detection model, and the method comprises the following steps:

slicing the input image to be detected through the feature extraction layer to obtain four feature map branches with the original half of the scale, and splicing the feature map branches to obtain a first feature map;

performing convolution operation on the first feature map through the convolution layer to obtain a second feature map;

respectively passing the second feature map through a global maximum pooling layer and a global average pooling layer by the channel dimension attention module to obtain a third feature map and a fourth feature map, then respectively inputting the third feature map and the fourth feature map into a multilayer perceptron, adding and operating output features of the multilayer perceptron based on element-wise, and activating through an activation function to generate a channel attention feature map;

and performing global maximum pooling and global average pooling on the channel attention feature map based on channels by the spatial dimension attention module to obtain a fifth feature map and a sixth feature map, performing concat splicing operation on the fifth feature map and the sixth feature map based on the number of the channels, performing channel dimension reduction by 7 × 7 convolution, and finally activating by an activation function to generate a spatial attention feature map.

Further, the performing differentiated feature fusion based on the channel attention feature map and the spatial attention feature map through a weighted bidirectional feature fusion network to obtain a fusion feature includes:

performing element-wise multiplication operation on the channel attention feature map and the second feature map to obtain a seventh feature map;

performing element-wise multiplication operation on the spatial attention feature map and the channel attention feature map to obtain an eighth feature map;

and learning the weights of different input features by adopting a weighted bidirectional feature fusion network in the feature fusion layer, and performing differentiated feature fusion on the seventh feature map and the eighth feature map to obtain fusion features.

Further, the feature fusion layer adopts a BIFPN structure.

Further, obtaining a detection result containing vehicle position and size information based on the fusion features includes:

and detecting the targets with different sizes through three detection layers with different sizes in the target detection layer based on the fusion characteristics to obtain a detection result containing the position and size information of the vehicle.

Further, the target detection layer comprises detection layers with three dimensions of 160 × 160, 80 × 80 and 40 × 40.

Further, the convolutional layer firstly obtains a first group of feature maps through limited common convolution, then generates a second group of feature maps through linear operation on the first group of feature maps, and finally splices the first group of feature maps and the second group of feature maps in a specified dimension to serve as a final convolution result; wherein the content of the first and second substances,

the general convolution operation is defined as:

Y＝X·ω+b

wherein Y represents the result of the general convolution operation, and X belongs to R^c×h×wRepresenting the input of the ordinary convolution operation, c, h and w respectively representing the number of input channels, the height and the width of a characteristic diagram, and omega epsilon R^c×k×k×nRepresenting that c × n convolution kernels with the size of k × k are used for performing convolution operation, and b is a deviation term;

the linear operation is defined as:

Y′＝X·ω′

Y_ij＝Φ_i，j，i∈[1，m]，j∈[1，s]

in which Y' is ∈ R^{h′×w′×m}Representing m intrinsic feature maps obtained by performing convolution operation on input X, wherein m is less than or equal to n; h 'and w' respectively represent the height and width of the intrinsic characteristic diagram, and the above formula generates n-m characteristic diagrams by linear operation, phi_i，jIs that each feature map is used to generate the jth Ghost feature map Y_ijAnd s represents mapping each feature map in Y's times.

In another aspect, the present invention further provides an attention mechanism and feature weighted fusion based vehicle detection apparatus, including:

the preprocessing module is used for preprocessing the image to be detected to obtain a preprocessed image to be detected;

the data input module is used for inputting the image to be detected after the pretreatment of the pretreatment module into a vehicle detection model which is trained in advance;

the attention feature map generation module is used for generating a channel attention feature map and a space attention feature map by adopting a channel and space two-dimensional attention mechanism based on the preprocessed image to be detected input by the data input module by utilizing the vehicle detection model;

the feature weighting fusion module is used for performing differentiated feature fusion on the basis of the channel attention feature map and the space attention feature map through a weighted bidirectional feature fusion network to obtain fusion features;

and the target detection module is used for obtaining a detection result containing the vehicle position and size information based on the fusion characteristics output by the characteristic weighting fusion module.

In yet another aspect, the present invention also provides an electronic device comprising a processor and a memory; wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the above-described method.

In yet another aspect, the present invention also provides a computer-readable storage medium having at least one instruction stored therein, the instruction being loaded and executed by a processor to implement the above method.

The technical scheme provided by the invention has the beneficial effects that at least:

according to the invention, a channel and space two-dimensional attention mechanism is adopted in the backbone network, and compared with other algorithms only adopting a single-dimensional attention mechanism, the two-dimensional structure can acquire more key vehicle characteristic information. In addition, the invention adopts a weighted bidirectional feature fusion method, balances different channels of the feature diagram, introduces a staged convolution calculation module, and performs differentiated fusion on different input features, thereby eliminating redundant features and reducing the calculation amount in the feature channel fusion process. In addition, the invention adopts a lightweight technology, only adopts limited common convolution, generates more characteristic graphs through simple linear operation, and then splices two groups of characteristic graphs, thereby reducing the size of the model. Compared with other related algorithms, the algorithm provided by the invention has the advantages that the detection speed is kept higher, the parameter quantity is reduced, the detection precision is improved, and the detection effect on small-scale vehicles is particularly obvious.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment of a vehicle detection method based on an attention mechanism and feature weighted fusion according to a first embodiment of the present invention;

FIG. 2 is a schematic flowchart of an implementation of a vehicle detection method based on attention mechanism and feature weighted fusion according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a network structure of a vehicle detection model according to a first embodiment of the present invention;

FIG. 4 is a schematic diagram of a network structure of an attention mechanism module according to a first embodiment of the present invention;

FIG. 5 is a schematic diagram of a network structure of a weighted bidirectional feature pyramid according to a first embodiment of the present invention;

fig. 6 is a schematic structural diagram of a ghestbottleeck module according to a first embodiment of the present invention; wherein, (a) is a schematic diagram of convolution operation, and (b) is a schematic diagram of linear operation;

FIG. 7 is a schematic diagram of a training process of a vehicle detection model according to a first embodiment of the present invention;

FIG. 8 is a block diagram of a vehicle detecting apparatus based on attention mechanism and feature weighted fusion according to a second embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

First embodiment

The embodiment provides a vehicle detection method based on attention mechanism and feature weighted fusion, and the implementation environment of the method is shown in fig. 1 and comprises at least one terminal 101 and at least one server 102 for providing service for the terminal 101. The terminal 101 is connected to the server 102 via a wireless or wired network, and the terminal 101 may be a computer or an intelligent terminal capable of accessing the server 102. For the vehicle detection process, the image to be recognized may be acquired through the terminal 101, a pre-trained vehicle detection model is stored in the server 102, after the image to be recognized is obtained by the terminal 101, the image to be recognized is sent to the server 102 by the terminal 101, and then the image to be recognized is input into the vehicle detection model by the server 102, so that the image to be recognized is recognized by the vehicle detection model. Alternatively, the identification process may be directly completed by the terminal 101, and the server 102 is configured to obtain a training sample data set, and train the vehicle detection model of the terminal 101 through the training sample data set, which is not limited in the present invention.

Specifically, the execution flow of the method is shown in fig. 2, and includes the following steps:

and S1, preprocessing the image to be detected to obtain a preprocessed image to be detected.

In this embodiment, in order to make the specification and size of the image to be detected of the input model uniform, normalization processing is performed on the image to be detected to unify the specification and size of the image, so as to obtain a preprocessed image.

And S2, inputting the preprocessed image to be detected into a vehicle detection model trained in advance.

It should be noted that, in this embodiment, a network structure of the vehicle detection model is shown in fig. 3, and includes: the system comprises a feature extraction layer, a feature fusion layer, a convolution block attention mechanism module CBAM, a spatial pyramid pooling layer SPP, a convolution layer, a target detection layer and an activation function; wherein the feature extraction layer comprises a slicing operation; the target detection layer comprises three different size scales; the convolution block attention module includes a channel dimension attention module and a spatial dimension attention module.

And S3, generating a channel attention feature map and a space attention feature map by using the vehicle detection model and a channel and space two-dimensional attention mechanism based on the preprocessed image to be detected.

It should be noted that, in this embodiment, a network structure of the two-dimensional attention mechanism is shown in fig. 4, based on which the implementation process of S3 is specifically as follows:

s31, slicing the input image to be detected through the feature extraction layer to obtain four feature map branches with half of the original dimension, and splicing the feature map branches to obtain a 320 x 12 first feature map.

S32, performing convolution operation on the first feature map through the convolution layer to obtain a 320 x 48 second feature map; wherein, the number of convolution kernels is 48, and the size of the convolution kernels is 3 multiplied by 3.

S33, respectively passing the second feature map through a global maximum pooling layer and a global average pooling layer through the channel dimension attention module to obtain a third feature map and a fourth feature map with the size of 1 x 1, respectively inputting the third feature map and the fourth feature map into a simple multilayer perceptron, carrying out addition operation based on element-wise on output features of the multilayer perceptron, and activating through a sigmoid activation function to generate a channel attention feature map (channel attention feature).

S34, obtaining a fifth feature map and a sixth feature map by taking the output of the channel dimension attention module as input through the spatial dimension attention module and the global maximum pooling and the global average pooling based on the channel, performing concat splicing operation on the fifth feature map and the sixth feature map based on the number of the channels, performing channel dimension reduction through 7 x 7 convolution, and finally activating through a sigmoid activation function to generate a spatial attention feature map (spatial attention feature map).

And S4, performing differentiated feature fusion based on the channel attention feature map and the space attention feature map through a weighted bidirectional feature fusion network to obtain fusion features.

It should be noted that, in this embodiment, the implementation process of S4 is as follows:

s41, performing element-wise multiplication operation on feature graphs output by the channel dimension attention module and the space dimension attention module and input of the feature graphs, and taking the feature graphs as input of a next-layer module; the processing manner of the two-dimensional attention mechanism feature map is shown in fig. 4.

And S42, learning the weights of different input features by adopting a weighted bidirectional feature fusion network in the feature fusion layer, and performing differentiated feature fusion to obtain fusion features.

In the feature fusion layer, the feature fusion mode of the feature pyramid + path aggregation network in the YOLOv5 network is replaced by a BIFPN structure as shown in fig. 5, and weights of different input features are learned through the network, wherein the output feature of each layer is

The computational expression of (a) is defined as:

wherein the content of the first and second substances,

represents an intermediate feature, defined as:

in the formula (I), the compound is shown in the specification,

is an input to the i-th layer,

represents the input of the (i + 1) th layer,

represents the output of layer i-1, Conv (.) represents the convolution operation, Resize is the sampling operation, w₁、w₂、w₃、w’₁、w’₂The weighting factors for different feature layers, i.e. the parameters that the network needs to learn, are of a size between 0 and 1, and e ═ 0.0001, which are parameters added to avoid numerical instability.

And S5, obtaining a detection result containing the vehicle position and size information based on the fusion characteristics.

It should be noted that, in this embodiment, the implementation process of S5 is as follows:

s51, generating a detection frame containing vehicle position and size information through three detection layers with different scales in the target detection layer;

and S52, detecting the targets with different sizes to obtain a detection result containing the information of the position and the size of the vehicle.

The target detection layer deletes a detection layer with a scale of 20 × 20 in a YOLOv5 network, and adds a detection layer with a scale of 160 × 160 at the same time, so that the neuron receptive field is reduced and the loss of detail semantics is reduced by increasing the size of the characteristic layer. And finally, detecting the targets with the sizes of 4 × 4, 8 × 8 and 16 × 16 respectively through detection layers with the sizes of 160 × 160, 80 × 80 and 40 × 40, and determining the target positions and the category information.

Furthermore, the light-weight module Ghostbottleneck is adopted in the embodiment to reduce the calculation amount and reduce the volume of the model. Specifically, as shown in fig. 6, the convolutional layer first obtains a part of feature maps through limited common convolution, then generates more feature maps from the obtained feature maps through linear operation, and finally splices two groups of feature maps in a specified dimension to obtain a final convolution result; wherein the content of the first and second substances,

the general convolution operation is defined as:

Y＝X·ω+b

the linear operation is defined as:

Y′＝X·ω′

Y_ij＝Φ_i,j,i∈[1,m],j∈[1,s]

in which Y' is ∈ R^{h′×w′×m}Representing m intrinsic feature maps obtained by performing convolution operation on input X, wherein m is less than or equal to n; h 'and w' respectively represent the height and width of the intrinsic characteristic diagram, and the above formula generates n-m characteristic diagrams by linear operation, phi_i,jIs that each feature map is used to generate the jth Ghost feature map Y_ijAnd s represents mapping each feature map in Y's times.

Further, a training flow of the vehicle detection model of the embodiment is shown in fig. 7, which specifically includes the following steps:

step 1, a training sample data set and an initial vehicle detection model as shown in fig. 3 are obtained.

It should be noted that when a complete data set passes through the neural network once and returns once, this process is called an epoch, i.e. all training samples of a data set are propagated in the neural network in a forward direction and in a backward direction.

And 2, performing data enhancement processing, and performing data enhancement by adopting methods of brightness contrast transformation, image compression and random cutting and splicing to obtain an enhanced sample data set.

And 3, performing primary training on the initial vehicle detection model by adopting a binary cross entropy Loss function GIoU Loss to obtain a pre-trained vehicle detection model.

Wherein the binary cross entropy loss function is defined as:

L_GIoU＝1-GIoU

wherein IoU is the ratio of the intersection and union area of the two detection frames, U is the union area, A^CThe area of the minimum closed frame, namely the minimum circumscribed rectangle, can surround the two detection frames.

Further, in this embodiment, the vehicle detection model further evaluates the model with an average detection accuracy, which is defined as:

wherein R is recall rate, P is precision rate:

the TP represents the number of samples with positive classes predicted by an algorithm, the FN represents the number of samples with positive classes predicted with negative classes, the FP represents the number of samples with negative classes predicted with positive classes, the N represents the number of samples, P (k) is the precision rate when k samples are identified at the same time, the DeltaR (k) represents the change of the recall rate when the number of samples is changed from k-1 to k, and the C is the number of classes.

Further, in this embodiment, the parameters are dynamically adjusted by using cosine annealing attenuation during the training process of the vehicle detection model.

And 4, inputting a training sample data set into the pre-training vehicle detection model, training the pre-training vehicle detection model according to the loss function until the convergence of the loss function value and the precision change meet the requirements, and finishing the training of the vehicle detection model to obtain the trained vehicle detection model.

In summary, in the embodiment, the two-dimensional attention mechanism of the channel and the space is adopted in the backbone network, and compared with other algorithms only adopting the one-dimensional attention mechanism, the two-dimensional structure can acquire more critical vehicle feature information. In addition, in the embodiment, a weighted bidirectional feature fusion method is adopted, different channels of the feature map are weighed, a staged convolution calculation module is introduced, different input features are distinguished and fused, redundant features are eliminated, and the calculation amount in the feature channel fusion process is reduced. In addition, the embodiment adopts a lightweight technology, only limited common convolution is adopted, more characteristic maps are generated through simple linear operation, and then two groups of characteristic maps are spliced, so that the size of the model is reduced. Compared with other related algorithms, the attention mechanism and feature weighting fusion-based vehicle detection method provided by the embodiment has the advantages that the faster detection speed is kept, the parameters are reduced, the detection precision is improved, and the detection effect on small-scale vehicles is particularly remarkable.

Second embodiment

The embodiment provides a vehicle detection device based on attention mechanism and feature weighted fusion, and the system structure of the vehicle detection device 800 is shown in fig. 8, and comprises the following modules:

the preprocessing module 801 is used for preprocessing an image to be detected to obtain a preprocessed image to be detected;

the data input module 802 is configured to input the image to be detected, which is preprocessed by the preprocessing module 801, into a vehicle detection model trained in advance;

an attention feature map generation module 803, configured to generate a channel attention feature map and a space attention feature map by using the vehicle detection model and using a channel and space two-dimensional attention mechanism based on the preprocessed image to be detected input by the data input module 802;

a feature weighting fusion module 804, configured to perform differentiated feature fusion based on the channel attention feature map and the spatial attention feature map through a weighted bidirectional feature fusion network to obtain a fusion feature;

and the target detection module 805 is configured to obtain a detection result including information of the position and size of the vehicle based on the fusion features output by the feature weighting fusion module 804.

The attention mechanism and feature weight fusion-based vehicle detection apparatus 800 of the present embodiment corresponds to the attention mechanism and feature weight fusion-based vehicle detection method of the first embodiment described above; the functions implemented by the functional modules in the vehicle detection apparatus 800 based on attention mechanism and feature weighted fusion in the present embodiment correspond to the flow steps in the vehicle detection method based on attention mechanism and feature weighted fusion in the first embodiment one to one; therefore, it is not described herein.

Third embodiment

The present embodiment provides an electronic device, as shown in fig. 9, the electronic device 900 may generate a relatively large difference due to different configurations or performances, and includes one or more processors (CPUs) 901 and one or more memories 902; in which at least one instruction is stored in the memory 902, and loaded and executed by the processor 901 to implement the method of the first embodiment.

Fourth embodiment

The present embodiment provides a computer-readable storage medium, in which at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the method of the first embodiment. The computer readable storage medium may be, among others, ROM, Random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. The instructions stored therein may be loaded and executed by a processor in the terminal.

Furthermore, it should be noted that the present invention may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

Finally, it should be noted that while the above describes a preferred embodiment of the invention, it will be appreciated by those skilled in the art that, once the basic inventive concepts have been learned, numerous changes and modifications may be made without departing from the principles of the invention, which shall be deemed to be within the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims

1. A vehicle detection method based on attention mechanism and feature weighted fusion is characterized by comprising the following steps:

2. The method according to claim 1, wherein the preprocessing is a normalization processing of the image to be detected to unify the image specification size.

3. The method of vehicle detection based on attention mechanism and feature-weighted fusion of claim 1, wherein the vehicle detection model comprises: the system comprises a feature extraction layer, a feature fusion layer, a convolution block attention mechanism module, a spatial pyramid pooling layer, a convolution layer, a target detection layer and an activation function; wherein the feature extraction layer comprises a slicing operation; the target detection layer comprises three different size scales; the convolution block attention module includes a channel dimension attention module and a spatial dimension attention module.

4. The method for vehicle detection based on attention mechanism and feature weighted fusion as claimed in claim 3, wherein the generating a channel attention feature map and a space attention feature map by using the vehicle detection model and the channel and space two-dimensional attention mechanism based on the preprocessed to-be-detected image comprises:

5. The attention mechanism and feature weighted fusion based vehicle detection method of claim 4, wherein the performing differentiated feature fusion based on the channel attention feature map and the spatial attention feature map through a weighted bi-directional feature fusion network to obtain a fused feature comprises:

6. The method for vehicle detection based on attention mechanism and feature weighted fusion of claim 5, wherein the feature fusion layer adopts a BIFPN structure.

7. The method of vehicle detection based on attention mechanism and feature-weighted fusion of claim 3, wherein obtaining a detection result including vehicle position and size information based on the fused features comprises:

8. The method of vehicle detection based on attention mechanism and feature-weighted fusion of claim 7, wherein the target detection layer comprises detection layers of three dimensions 160 x 160, 80 x 80, and 40 x 40.

9. The attention mechanism and feature weight fusion based vehicle detection method of claim 3, wherein the convolutional layer first obtains a first set of feature maps through finite ordinary convolution, then generates a second set of feature maps through linear operation on the first set of feature maps, and finally splices the first set of feature maps and the second set of feature maps in a specified dimension as a final convolution result; wherein the content of the first and second substances,

the general convolution operation is defined as:

Y＝X·ω+b

wherein Y represents the result of the general convolution operation, and X belongs to R^c×h×wRepresenting the input of the ordinary convolution operation, c, h and w respectively representing the number of input channels, the height and the width of a characteristic diagram, and omega epsilon R^c×k×k×nRepresenting that the convolution operation is performed with c x n convolution kernels of size k x k,b is a deviation term;

the linear operation is defined as:

Y′＝X·ω′

Y_ij＝Φ_i，j，i∈[1，m]，j∈[1，s]

10. An attention mechanism and feature weighted fusion based vehicle detection apparatus, characterized in that the attention mechanism and feature weighted fusion based vehicle detection apparatus comprises: