CN116704317B

CN116704317B - Target detection method, storage medium and computer device

Info

Publication number: CN116704317B
Application number: CN202310998984.6A
Authority: CN
Inventors: 黄军文; 刘魏魏; 陈兴委; 李文强
Original assignee: Shenzhen Huafu Technology Co ltd
Current assignee: Shenzhen Huafu Technology Co ltd
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2024-04-19
Anticipated expiration: 2043-08-09
Also published as: CN116704317A

Abstract

The application discloses a target detection method, a storage medium and computer equipment. The target detection method comprises the following steps: constructing FEDNet a network model, wherein a C3 module in a backbone network of the YOLOv5-s model is replaced by a C3LP module, and a C3 module in a feature aggregation network of the YOLOv5-s model is replaced by a C3TB module to obtain a FEDNet network model; training FEDNet the network model by adopting a training set; and detecting the picture to be detected by adopting the trained FEDNet network model so as to identify the detected target in the picture to be detected. By the method, the target detection accuracy can be effectively improved, and the false detection rate or the omission factor of a small target is reduced.

Description

Target detection method, storage medium and computer device

Technical Field

The present application relates to the field of target detection technologies, and in particular, to a target detection method, a storage medium, and a computer device.

Background

Currently, artificial intelligence technology has been widely applied in various industries, and particularly plays an important role in the field of intelligent monitoring. Fire accidents are one of the safety accidents that frequently occur in recent years and are extremely liable to cause serious economic losses. Many large-area fires are often caused by fire extinguishing facilities which are not convenient at the place where the fire occurred at the beginning of the fire, and therefore it is important to arrange fire extinguishing facilities at the place where the fire may occur (particularly when a live fire construction operation exists). But this is likely to result in a fire because of a low safety awareness of the constructors or because of carelessness of the constructors when the fire extinguisher is not placed. Therefore, real-time monitoring of fire extinguisher configuration by deploying a detection algorithm in the monitoring device is necessary.

With the development of the deep learning technology, the object detection method based on the deep learning has achieved remarkable results, wherein the object detection method is represented by SSD (Single Shot Multi-Box Detector) series and YOLO (You Only Look Once) series, and further comprises FCOS, centerNet and the like. The YOLO series is most rapidly developed and widely applied, and a plurality of methods, such as YOLOv 1-YOLOv, YOLOX, YOLOR, PP-YOLOv2 and the like, are developed from YOLOv to the present, and are explored in various aspects of a main network, a feature aggregation network, a loss function, a sample distribution strategy and the like, so that the detection efficiency of the model is greatly improved.

However, fire extinguisher detection still has a plurality of challenges, the fire extinguisher often occupies a smaller area in a picture or is in a state of being seriously blocked, and the fire extinguisher is usually red and is similar to a red cylinder, so that the condition of false detection or omission detection is easy to occur.

Disclosure of Invention

The application mainly provides a target detection method, a storage medium and computer equipment, which are used for solving the problem that the false detection or omission rate of the existing target detection algorithm aiming at a small target is higher.

In order to solve the technical problems, the application adopts a technical scheme that: a target detection method is provided. The target detection method comprises the following steps: constructing FEDNet a network model, wherein a C3 module in a backbone network of the YOLOv5-s model is replaced by a C3LP module, and a C3 module in a feature aggregation network of the YOLOv5-s model is replaced by a C3TB module, so as to obtain the FEDNet network model; training the FEDNet network model by adopting a training set; and detecting the picture to be detected by adopting the trained FEDNet network model so as to identify the detected target in the picture to be detected.

In some embodiments, the building FEDNet a network model, in which C3 modules in a backbone network of a YOLOv-s model are replaced with C3LP modules, C3 modules in a feature aggregation network of the YOLOv-s model are replaced with C3TB modules, to obtain the FEDNet network model, comprises:

Adopting a lightweight multi-layer perceptron and a parallel attention mechanism to design and generate the C3LP module, and adopting the C3LP module to replace a C3 module in the backbone network;

And constructing TransBlock modules by adopting a transducer model, adopting the TransBlock module design to generate the C3TB module, and adopting the C3TB module to replace the C3 module in the characteristic aggregation network.

In some embodiments, the C3LP module comprises a first CBS layer, a residual component, a second CBS layer, the lightweight multi-layer perceptron, a first Concat module, the parallel attention mechanism, and a third CBS layer;

The output end of the first CBS is connected with the input end of the residual error assembly, the output end of the residual error assembly is connected with one input end of the first Concat module, the output end of the second CBS is connected with the input end of the lightweight multi-layer perceptron, the output end of the lightweight multi-layer perceptron is connected with the other input end of the first Concat module, the output end of the first Concat module is connected with the input end of the parallel attention mechanism, and the output end of the parallel attention mechanism is connected with the input end of the third CBS layer.

In some embodiments, the lightweight multi-layer perceptron is calculated as follows:

wherein M (X) represents the lightweight multi-layer perceptron, and X represents an input feature; DW represents depthwise convolutions; GN represents Group Normalization; drop represents a drop operation; conv _1×1 represents a1×1 convolution; GELU denotes GELU activation function;

The calculation formula of the parallel attention mechanism is as follows:

wherein PAM (X) represents a parallel attention mechanism; CAM represents channel attention; SAM denotes spatial attention.

In some embodiments, the C3TB module includes a fourth CBS layer, the TransBlock module, a fifth CBS layer, a second Concat module, and a sixth CBS layer, an output of the fourth CBS layer is connected to an input of the TransBlock module, an output of the TransBlock module is connected to an input of the second Concat module, an output of the fifth CBS layer is connected to another input of the second Concat module, and an output of the second Concat module is connected to an input of the sixth CBS layer.

In some embodiments, the TransBlock module calculates the following formula:

Wherein X represents an input feature; TB (X) represents the TransBlock module; linear represents a full connectivity layer network; MHA represents a multi-headed attention mechanism; LN representation layer Normalization; reLU represents a ReLU activation function;

The calculation formula of the multi-head attention mechanism is as follows:

wherein Sigmoid represents a Sigmoid activation function; ⊗ denotes a matrix product; scale is the scaling factor and drop represents a drop operation.

In some embodiments, the training the FEDNet network model using a training set includes:

acquiring a picture containing a detected target, and carrying out data marking and dividing on the picture to obtain the training set and the testing set;

training the FEDNet network model by adopting the training set, wherein a SimOTA algorithm is utilized to perform dynamic positive and negative sample distribution, and a loss function is calculated by utilizing the distributed positive and negative samples so as to adjust the parameters of the FEDNet network model;

And evaluating the performance of the FEDNet network model after training by adopting the test set.

In some embodiments, the detecting the picture to be detected using the FEDNet network model includes:

And screening the target frame by using an NMS non-maximum inhibition method to obtain a final detection frame.

In order to solve the technical problems, the application adopts another technical scheme that: a storage medium is provided. The storage medium has stored thereon program data which, when executed by a processor, implements the steps of the object detection method as described above.

In order to solve the technical problems, the application adopts another technical scheme that: a computer device is provided. The computer device comprises a processor and a memory connected to each other, said memory storing a computer program, said processor implementing the steps of the object detection method as described above when executing said computer program.

The beneficial effects of the application are as follows: the application discloses a target detection method, a storage medium and computer equipment, which are different from the prior art. The C3 module in the backbone network of the YOLOv-s model is replaced by the C3LP module based on the YOLOv-s model, the C3 module in the characteristic aggregation network of the YOLOv-s model is replaced by the C3TB module, so as to obtain a FEDNet network model, and further, the target detection is carried out on the picture to be detected by utilizing the trained FEDNet network model, so that the attention of the network to the small target is increased through the improved C3LP module, and the characteristic information of the small target in the characteristic is increased; and the loss of target position information in the down sampling process of the network is reduced through the improved C3TB module, so that the robustness of FEDNet network models is improved, the target detection precision is effectively improved, and the false detection rate or omission rate of small targets is reduced.

Drawings

For a clearer description of embodiments of the application or of solutions in the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are only some embodiments of the application, from which, without the inventive effort, other drawings can be obtained for a person skilled in the art, in which:

FIG. 1 is a flow chart of an embodiment of a target detection method according to the present application;

FIG. 2 is a schematic diagram of the FEDNet network model described in the embodiment of FIG. 1;

FIG. 3 is a schematic flow chart of step 10 in the embodiment of FIG. 1;

FIG. 4 is a schematic diagram of the structure of the C3LP module of the embodiment of FIG. 3;

FIG. 5 is a schematic diagram of the structure of the C3TB module in the embodiment of FIG. 3;

FIG. 6 is a schematic diagram of the multi-head attention mechanism of the C3TB module of FIG. 5;

FIG. 7 is a flow chart of step 20 in the embodiment of FIG. 1;

FIG. 8 is a schematic diagram illustrating the structure of an embodiment of a storage medium according to the present application;

Fig. 9 is a schematic structural diagram of an embodiment of a computer device provided by the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," "third," and the like in embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", and "a third" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The application provides a target detection method, referring to fig. 1, fig. 1 is a flow chart of an embodiment of the target detection method provided by the application, the target detection method comprises:

Step 10: a FEDNet network model was constructed in which the C3 modules in the backbone network of the YOLOv5-s model were replaced with C3LP modules and the C3 modules in the feature aggregation network of the YOLOv5-s model were replaced with C3TB modules to obtain the FEDNet network model.

YOLO (You Only Look Once) is an algorithm for target detection by using a convolutional neural network, YOLOv-s is a version of the algorithm, and the target detection method protected by the application is improved based on a YOLOv5-s model.

The YOLOv-s model includes an input, a Backbone network (Backbone), a feature aggregation network, and a Prediction structure (Prediction), and is modified based on the YOLOv-s model to obtain a FEDNet network model, wherein C3 modules in the Backbone network are replaced with C3LP modules, and C3 modules in the feature aggregation network are replaced with C3TB modules.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating the structure of the FEDNet network model described in the embodiment of fig. 1.

FIG. 2 is an improved FEDNet network model, the improvement of which FEDNet network model is that based on YOLOv5-s model, the C3 module in backbone network is replaced by C3LP module, the C3 module in feature aggregation network is replaced by C3TB module, and the algorithm logic of the rest modules is kept the same.

The backbone network is a network for extracting characteristics, and the backbone network is used for extracting information in the picture for later networks; the feature aggregation network further performs feature fusion on the features extracted by the backbone network, so that the robustness of the model is improved; a Prediction structure (Prediction) makes predictions using previously extracted features.

Wherein the CBS layer is a basic operation unit composed of a convolution layer (Conv), a normalization layer (BN) and an activation function layer (SiLU); resunit is a residual component; the C3LP module is specifically a C3LP_X module, wherein the serial X group residual error assembly can avoid the problem of effect degradation after the network is deepened, and the lightweight multi-layer perceptron (Light MLP) can enhance the acquisition capability of the network for the position information; SPPF is a space pyramid pooling layer, and the receptive field of the features is increased through pooling kernels with different scales; upsampling is an Upsampling layer, i.e., an amplifying feature, and is mainly used for amplifying original features; concat is a feature fusion layer, two features can be spliced, for example, two input feature shapes are (b, c1, h, w) and (b, c2, h, w), respectively, and the output feature shape is (b, c1+c2, h, w); the C3TB module is specifically a c3tb_x module, which mainly performs fusion convergence through a serial 2X group TransBlock module and CBS layer.

Referring to fig. 3, fig. 3 is a schematic flow chart of step 10 in the embodiment of fig. 1.

Specifically, the step of replacing the C3 module of the backbone network with the C3LP module and the step of replacing the C3 module of the feature aggregation network with the C3TB module in step 10 may be performed as follows.

Step 11: the method comprises the steps of designing and generating a C3LP module by adopting a lightweight multi-layer perceptron and a parallel attention mechanism, and replacing the C3 module in a backbone network by adopting the C3LP module.

Referring to FIG. 4, FIG. 4 is a schematic diagram of the structure of the C3LP module of the embodiment of FIG. 3. The C3LP module includes a first CBS layer, a residual component (Resunit), a second CBS layer, a lightweight multi-layer perceptron (Light MLP), a first Concat module, a parallel attention mechanism (Paddle attention module, PAM), and a third CBS layer; the output end of the first CBS is connected with the input end of a residual error assembly (Resunit), the output end of the residual error assembly (Resunit) is connected with one input end of the first Concat module, the output end of the second CBS is connected with the input end of a lightweight multi-layer perceptron (Light MLP), the output end of the lightweight multi-layer perceptron (Light MLP) is connected with the other input end of the first Concat module, the output end of the first Concat module is connected with the input end of a Parallel Attention Mechanism (PAM), the output end of the Parallel Attention Mechanism (PAM) is connected with the input end of a third CBS layer, the input ends of the first CBS layer and the second CBS layer are input with the same characteristic diagram and serve as the input end of the C3LP module, and the output end of the third CBS layer serves as the output end of the C3LP module.

The C3LP module is specifically a C3LP_X module, X in the C3LP_X module represents the number of residual error components, and X can be 1,2 or 3; for example, the c3lp_1 module in the backbone network, representing 1 set of residual components therein; c3lp_2 module, representing within which are 2 sets of residual components; the c3lp_3 module represents 3 sets of residual components therein.

The algorithm logic of the CBS layer, the residual component (Resunit) and the Concat module is the same as that of the original YOLOv5-s model, and will not be described again.

In this embodiment, the calculation formula of the lightweight multi-layer perceptron (Light MLP) is as follows:

wherein M (X) represents the lightweight multi-layer perceptron, and X represents an input feature; DW represents a depth separable convolution (depthwise convolutions); GN represents Group Normalization; drop represents a Drop operation; conv _1×1 represents a1×1 convolution; GELU denotes GELU activation functions.

The Parallel Attention Mechanism (PAM) is calculated as follows:

Wherein PAM (X) represents a parallel attention mechanism, X represents an input feature; CAM represents channel attention; SAM denotes spatial attention.

The C3LP_X module combines the residual error component (Resunit) and the characteristics extracted by the lightweight multi-layer perceptron (Light MLP), and adopts a parallel attention mechanism, thereby reducing the loss of position information of a network in the process of extracting the characteristics, improving the attention of the network to small targets, increasing the characteristic information of the small targets in the characteristics, avoiding the neglect of the small-size target characteristics, reducing the information loss and improving the target detection precision.

Step 12: and constructing TransBlock modules by adopting a transducer model, adopting TransBlock module design to generate C3TB modules, and adopting the C3TB modules to replace the C3 modules in the feature aggregation network.

Referring to fig. 5, fig. 5 is a schematic structural diagram of the C3TB module in the embodiment of fig. 3. The C3TB module comprises a fourth CBS layer, transBlock modules, a fifth CBS layer, a second Concat module and a sixth CBS layer, wherein the output end of the fourth CBS layer is connected with the input end of the TransBlock module, the output end of the TransBlock module is connected with one input end of the second Concat module, the output end of the fifth CBS layer is connected with the other input end of the second Concat module, the output end of the second Concat module is connected with the input end of the sixth CBS layer, the input ends of the fourth CBS layer and the fifth CBS layer are used for inputting the same characteristic diagram and serving as the input end of the C3TB module, and the output end of the sixth CBS layer is used as the output end of the C3TB module.

The C3TB module is specifically a C3TB_X module, X in the C3TB_X module represents that the number of TransBlock modules is 2X, and X can be 1,2 or 3, etc.; such as the c3tb_1 module in the feature aggregation network, representing the 2 groups TransBlock of modules therein.

In this embodiment, the calculation formula of the TransBlock module is as follows:

Wherein X represents an input feature; TB (X) represents the TransBlock module; linear represents a full connectivity layer network; MHA represents a multi-headed attention mechanism; LN representation layer Normalization; reLU represents a ReLU activation function.

Referring to fig. 6, fig. 6 is a schematic diagram of the multi-head attention mechanism in the c3tb module of fig. 5.

The calculation formula of the multi-head attention Mechanism (MHA) is as follows:

By adopting the transducer model to construct TransBlock modules, the attention mechanism in the transducer model is utilized to weight the features so as to extract the relation between the regions, strengthen the relation between the position features in the shallow feature map and the category features in the deep feature map, further the obtained C3TB module can further improve the robustness of the model, reduce the information loss, avoid the neglect of the small-size target features and improve the target detection precision.

Step 20: and training the FEDNet network model by adopting a training set.

After FEDNet the network model is built, the FEDNet network model needs to be trained to teach the FEDNet network model to identify the detected target, such as the detected target mask, pedestrian or fire extinguisher.

In this embodiment, the detected target is a fire-fighting tool, which may specifically be a fire extinguisher, for example, by identifying whether the fire-fighting tool equipment at each fire-fighting site meets the requirements through the FEDNet network model.

Referring to fig. 7, fig. 7 is a schematic flow chart of step 20 in the embodiment of fig. 1. Specifically, the step of training FEDNet the network model using the training set in step 20 may be performed as follows.

Step 21: and acquiring a picture containing the detected target, and carrying out data marking and dividing on the picture to obtain a training set and a testing set.

For example, if the detected target is a fire extinguisher, acquiring fire extinguisher pictures in various scenes, and marking and dividing data to obtain a training set and a testing set.

Specifically, fire extinguisher pictures in various scenes can be obtained through a network or a self-shooting mode and the like, and screening is carried out; and marking all the pictures by using a marking tool labelImg, wherein the marking content is the position of the fire extinguisher in the picture, marking the picture by using a rectangular frame, and storing the label in an xml file, wherein the format is PASACL VOC.

Optionally, if the detected object is a vehicle, vehicle pictures in various scenes can be obtained; the marking tool may also be other types of marking tools known in the art, as the application is not limited in this regard.

In this embodiment, the fire extinguisher pictures are divided into a training set and a test set according to a ratio of 4:1, and the obtained pictures can also be divided by adopting a ratio of 5:1 or 6:1.

Step 22: and training the FEDNet network model by adopting a training set, wherein a SimOTA algorithm is utilized to dynamically allocate positive and negative samples, and a loss function is calculated by utilizing the allocated positive and negative samples so as to adjust the parameters of the FEDNet network model.

And marking the position of a detected target on each picture in the training set by using a rectangular frame, carrying out dynamic positive and negative sample distribution by using a SimOTA algorithm after predicting a Prediction structure (Prediction) in a FEDNet network model, wherein the target frame is correctly predicted to belong to positive samples, the target frame is incorrectly predicted to belong to negative samples, positive and negative sample distribution is required to be carried out in each training, and a loss function is calculated by using the distributed positive and negative samples so as to adjust the parameters of the FEDNet network model.

Specifically, the loss function calculation formula is as follows:

wherein, 、/>Respectively representing the classification loss and the confidence loss, which can be obtained by calculating the binary cross entropy loss; /(I)The regression loss of the target frame is calculated by using GIoU functions; /(I)、/>、/>Super parameters for balancing the three loss functions; ioU denotes the intersection ratio of two rectangular boxes; /(I)The area of the minimum envelope rectangle representing two rectangle boxes; /(I)Representing the area of intersection of two rectangular boxes.

In the present embodiment of the present invention,、/>、/>Are set to 2.0, 1.0 and 5.0, respectively. In other embodiments,/>、/>、/>The settings may be made on an as-needed basis.

After the loss function calculation is completed, parameters of the FEDNet network model can be guided to be adjusted so as to improve the identification accuracy of the FEDNet network model.

Step 23: and evaluating the performance of the FEDNet network model after training by adopting a testing set.

After training the FEDNet network model by the training set, evaluating the performance of the FEDNet network model after training by the testing set to determine whether the performance of the FEDNet network model meets the requirement, wherein the performance index can comprise accuracy, calculated amount, model parameter, picture processing rate and the like.

Step 30: and detecting the picture to be detected by adopting the trained FEDNet network model so as to identify the detected target in the picture to be detected.

And (3) carrying out target detection on the picture to be detected by the trained FEDNet network model, and directly applying an NMS non-maximum inhibition method to screen a target frame after predicting a Prediction structure (Prediction) to obtain a final detection frame.

When the target detection method is used for verification, the verification experiment adopts 5466 pieces of sample data in total in a homemade dataset, wherein the number of samples in a training set is 4376, and the number of samples in a test set is 1090. The model is built, trained and tested in the experiment under Pytorch framework, and CUDA and CUDNN are used for accelerating calculation.

The model training and testing provided by the target detection method are completed under ubuntu 16.04.04 operating system, the processor is a 32Gbits inter (R) Xeon (R) Silver 4114 CPU @2.20 GHz processor, the display card is a16 Gbits NVIDIA TESLA V-PCIE GPU, the epoch is set to be 1000 in the training process, the batch size is 16, the SGD optimizer is used, the initial learning rate is 0.01, the minimum learning rate is 0.00005, the cosine annealing learning rate reduction strategy is used, the weight attenuation coefficient is 0.937, the confidence threshold is 0.001, and the IOU threshold is 0.6.

To verify the effectiveness of the present target detection method, experiments were performed on different methods using the same dataset under the same hardware device: YOLOv3, YOLOv, YOLOv-s, YOLOX and methods provided herein.

The experimental results are shown in the following table, and according to the experimental results in the following table, the target detection method provided by the application can effectively detect the position of the fire extinguisher in the picture, and mAP0.5 of the FEDNet network model in the application reaches 94.0%, which is 1.4% and 0.2% higher than YOLOv-s and YOLOX respectively.

Model	Model parameter quantity	FLOPs	mAP0.5:0.95/%	mAP0.5/%	FPS
						YOLOv3	61.49M	154.5G	92.3	63.8	60
YOLOv4	67.8M	139.7G	92.5	64.0	55
						YOLOv5-s	7.01M	15.8G	92.6	64.4	137
YOLOX	8.01M	21.3G	93.8	68.0	119
						FEDNet	8.15M	18.3G	94.0	68.0	110

Model parameters: the number of all the parameters to be learned in the model mainly reflects the size of the memory occupied by the model, and the larger the number of the parameters is, the more the memory is occupied.

FLOPs (floating point operations): refers to the calculation amount and calculation speed of model forward propagation, and is used for measuring the complexity of the model.

MAP0.5:0.95: refers to the accuracy of the detection frame position.

MAP0.5: refers to the accuracy of the detection frame class.

FPS (Frame Per Second): number of pictures processed per second by the model.

Therefore, various indexes such as model parameter number, FLOPs, mAP0.5:0.95, mAP0.5, FPS and the like of the FEDNet network model are all in the front.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a storage medium according to an embodiment of the present application.

The storage medium 40 stores program data 41, which program data 41, when executed by a processor, implements the steps of the object detection method as described in fig. 1.

The program data 41 is stored in a storage medium 40 comprising instructions for causing a network device (which may be a router, personal computer, server, etc.) or processor to perform all or part of the steps of the method according to the various embodiments of the application.

Alternatively, the storage medium 40 may be a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disk, or other various media that can store the program data 41.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

The computer device 50 comprises a processor 52 and a memory 51, which are interconnected, the memory 51 storing a computer program, which, when executed by the processor 52, implements the steps of the object detection method as described above.

The application discloses a target detection method, a storage medium and computer equipment, which are different from the prior art. The C3 module in the backbone network of the YOLOv-s model is replaced by the C3LP module based on the YOLOv-s model, the C3 module in the characteristic aggregation network of the YOLOv-s model is replaced by the C3TB module, so as to obtain a FEDNet network model, further, the target detection is carried out on the picture to be detected by utilizing the trained FEDNet network model, the attention degree of the network to the small target can be increased by a parallel attention mechanism in the network, and the characteristic information of the small target in the characteristic is increased; the lightweight multi-layer perceptron and the transducer structure have the capability of acquiring long-distance dependence, and can reduce the loss of target position information in the network downsampling process. Therefore, the robustness of FEDNet networks is improved, the accuracy of target detection is effectively improved, and the detection accuracy of small targets is improved.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the storage medium embodiments and the electronic device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described device embodiments are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing description is only illustrative of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present application.

Claims

1. A method of detecting an object, comprising:

Constructing a FEDNet network model, wherein a C3 module in a backbone network of the YOLOv5-s model is replaced by a C3LP module, a C3 module in a characteristic aggregation network of the YOLOv-s model is replaced by a C3TB module, and the C3TB module is formed by constructing a TransBlock module by using a transducer model and designing by using the TransBlock module so as to obtain the FEDNet network model;

training the FEDNet network model by adopting a training set;

detecting a picture to be detected by adopting the trained FEDNet network model so as to identify a detected target in the picture to be detected;

The C3LP module comprises a first CBS layer, a residual error component, a second CBS layer, a lightweight multi-layer perceptron, a first Concat module, a parallel attention mechanism and a third CBS layer, wherein the output end of the first CBS layer is connected with the input end of the residual error component, the output end of the residual error component is connected with one input end of the first Concat module, the output end of the second CBS layer is connected with the input end of the lightweight multi-layer perceptron, the output end of the lightweight multi-layer perceptron is connected with the other input end of the first Concat module, the output end of the first Concat module is connected with the input end of the parallel attention mechanism, and the output end of the parallel attention mechanism is connected with the input end of the third CBS layer;

the calculation formula of the lightweight multi-layer perceptron is as follows:

The calculation formula of the parallel attention mechanism is as follows:

2. The method of claim 1, wherein the C3TB module includes a fourth CBS layer, the TransBlock module, a fifth CBS layer, a second Concat module, and a sixth CBS layer, an output of the fourth CBS layer is connected to an input of the TransBlock module, an output of the TransBlock module is connected to an input of the second Concat module, an output of the fifth CBS layer is connected to another input of the second Concat module, and an output of the second Concat module is connected to an input of the sixth CBS layer.

3. The target detection method according to claim 2, wherein the TransBlock module has a calculation formula as follows:

The calculation formula of the multi-head attention mechanism is as follows:

wherein Sigmoid represents a Sigmoid activation function; representing the matrix product; scale is the scaling factor and drop represents a drop operation.

4. The method of claim 1, wherein training the FEDNet network model using a training set comprises:

5. The method for detecting a target according to claim 1, wherein the detecting the picture to be detected using the trained FEDNet network model includes:

6. A storage medium having stored thereon program data, which when executed by a processor, implements the steps of the object detection method according to any of claims 1-5.

7. A computer device comprising a processor and a memory connected to each other, the memory storing a computer program, the processor implementing the steps of the object detection method according to any one of claims 1-5 when executing the computer program.