CN111832513A

CN111832513A - Real-time football target detection method based on neural network

Info

Publication number: CN111832513A
Application number: CN202010705052.4A
Authority: CN
Inventors: 段育松; 姬红兵; 张文博; 李晓颖; 李林; 臧博
Original assignee: Xi'an Broccoli Education Technology Co ltd; Xidian University
Current assignee: Xi'an Broccoli Education Technology Co ltd; Xidian University
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2020-10-27
Anticipated expiration: 2040-07-21
Also published as: CN111832513B

Abstract

The invention discloses a real-time football target detection method based on a neural network, which mainly solves the problems of low speed and low precision of the existing football target detection. The scheme is as follows: 1) acquiring a football target detection network YOLOv 4; 2) constructing a football target training data set; 3) obtaining the prior frame size of the constructed training data set, and replacing the prior frame size with the prior frame size in a target detection network YOLOv 4; 4) performing data augmentation on the training data set; 5) training a target detection network YOLOv4 by using the augmented data set; 6) and inputting the football target video to be detected into the trained YOLOv4 football target detection network for detection and labeling, and outputting the detection result of the football target. The invention enhances the identification and positioning capability of the network, improves the detection speed and the detection precision of the football target, ensures the real-time property of the football target detection, and can be used for human-computer interaction, sports events, live broadcast and motion analysis.

Description

Real-time football target detection method based on neural network

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a football target detection method which can be used for man-machine interaction, sports events and motion analysis.

Background

The football target detection is to judge whether a football target exists in an image or a video sequence by using a computer vision technology and give accurate positioning. The technique can be applied to human-computer interaction, sports events, live broadcast and motion analysis. Through the development of a soccer target detection system for more than ten years, the training library of the soccer target detection system tends to be large-scale, the detection precision tends to be practical, and the detection speed tends to be real-time. In the traditional football target detection method, the method mainly focuses on the aspects of extraction of manual features, learning and post-processing of a feature classifier and the like, and because football match videos have diversity and complexity, football shooting speed is high, shooting pixels cannot be guaranteed, stress is large, moving distance is long, football targets in the videos are different in size, and the football targets are easily shielded by players and referees, so that the traditional football target detection method is low in precision. And the football target detection form is mainly based on images due to the limitation of computer hardware conditions in the past, only whether the football target exists in the images is required to be detected, and the real-time football target detection problem is difficult to solve.

The existing real-time football target detection method based on the neural network comprises one-stage and two-stage. A one-stage target detection algorithm SSD proposed in the paper "SSD: Single Shot MultiBox Detector" published by Wei Liu in ECCV 2016. The method discretizes the output space of the bounding box into a set of default boxes according to the different aspect ratios of each feature map location. During prediction, the network generates a score for each object class in each default box and adjusts that box to better match the object shape. In addition, the network incorporates predictions from multiple feature maps of different resolutions, naturally dealing with objects of various sizes. However, the method has the disadvantage that in the training process of the SSD, the IOU between the prior frame and the real frame reaches 0.5 before the prior frame and the real frame are put into the network for training. The large target ROI will have a much larger value and therefore contain more a priori boxes and can be adequately trained. Conversely, a small target may have a much smaller a priori box for training and may not receive sufficient training. Therefore, the SSD has insufficient detection accuracy for small targets and is not positioned accurately.

A two-stage target detection algorithm, Faster R-CNN, proposed in the IEEE paper published in 2017 by Towards read-Time object detection with Region pro-posal Networks. The method comprises two modules, wherein one module is a deep full convolution network (RPN) and is used for generating a regional scheme; the second module is the Fast R-CNN detector, which uses the RPN generated region scheme for detection. A regional proposal network RPN is introduced which shares the full image convolution feature with the detection network, thereby achieving a near cost-free regional proposal. However, the method has the defects that the Faster R-CNN is performed in two steps in the training process, so that the target detection speed is low, and the real-time performance of the target detection cannot be ensured.

Disclosure of Invention

The invention aims to provide a real-time football target detection method based on a neural network aiming at the defects of the prior art, so as to improve the precision and speed of football target detection and ensure the real-time performance of football target detection.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

(1) constructing a football target detection network YOLOv 4:

(1a) constructing a backbone network CSPDarknet53 of a target detection network YOLOv 4;

(1b) constructing a neck network PANet of a target detection network YOLOv 4;

(1c) building a Head network YOLO Head of a target detection network YOLOv 4;

(2) constructing a training data set:

(2a) collecting at least 3000 images containing a soccer target and having a resolution of not less than 608 x 608;

(2b) manually marking a football boundary frame in each image containing the football to generate annotation files corresponding to the acquired images one by one;

(2c) forming a training data set by the acquired images and the annotation files;

(3) training the target detection network YOLOv 4:

(3a) configuring a target detection network YOLOv4 environment;

(3b) downloading a pre-training weight file YOLOv4.conv.137 of a target detection network YOLOv 4;

(3c) obtaining the prior frame size of the constructed data set by using a k-means clustering method, and updating the prior frame size in a target detection network YOLOv 4;

(3d) inputting a training data set, and performing data augmentation on the training data set by adopting a CutMix method;

(3e) loading a pre-training weight file YOLOv4.conv.137 on the target detection network YOLOv4 constructed in the step (1) by using a transfer learning method to obtain a loaded target detection network YOLOv 4;

(3f) training the training data set constructed in the step (2) by using the loaded target detection network YOLOv4 to obtain a trained YOLOv4 football target detection network;

(4) the method comprises the steps of collecting a video containing a football target, inputting the video into a trained YOLOv4 football target detection network for detection and labeling, outputting the video labeled with football pixels, and obtaining a detection result of the football target.

Compared with the prior art, the invention has the following advantages:

firstly, because the invention constructs the YOLOv4 football target detection network formed by combining the backbone network CSPDarknet53, the neck network PANet and the Head network YOLO Head, and introduces the characteristic pyramid module into the neck network PANet, the invention detects the target on3 characteristic graphs with different scales by sampling and fusing the characteristics of different layers and utilizing the high resolution of the bottom layer characteristics and the semantic information of the high layer characteristics, and allocates accurate anchor point frames to the characteristic graphs with different scales, thereby improving the precision of football target detection.

Secondly, the method of the invention uses the CutMix method to amplify the input data, thereby improving the network training efficiency, enhancing the recognition and positioning capability of the network and further improving the generalization of the network.

Thirdly, the constructed data set is trained by using a transfer learning method, so that the target detection network can rapidly learn the high-dimensional characteristics of the data set, the time complexity of football target detection is reduced, the detection speed is further improved, and the real-time performance of the football target detection is ensured.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of the YOLOv4 network framework in the present invention;

fig. 3 is a schematic structural diagram of the CSPDarknet53 network in the present invention;

FIG. 4 is a graph of the results of training on a constructed data set using the present invention;

FIG. 5 is a graph of the results of an experiment using the present invention to perform soccer detection on a soccer video;

FIG. 6 is a graph showing FPS test results of frames per second for football video detection using the present invention.

Detailed Description

The embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of this embodiment are as follows.

Step 1, constructing a football target detection network YOLOv4.

Referring to fig. 2, the soccer target detection network YOLOv4 includes: a backbone network CSPDarknet53, a neck network PANet, a Head network YOLO Head, which is implemented as follows:

1.1) building a backbone network CSPDarknet53 of a target detection network YOLOv 4:

referring to fig. 3, the structural relationship of the backbone network CSPDarknet53 is: the input layer → the first buildup layer → the second buildup layer → the first module → the third buildup layer → the fourth buildup layer → the second module → the fifth buildup layer → the sixth buildup layer → the third module → the seventh buildup layer → the eighth buildup layer → the fourth module → the ninth buildup layer → the tenth buildup layer → the fifth module → the eleventh buildup layer. Wherein:

the first convolution layer to the eleventh convolution layer are CBM convolution layers, namely, the first convolution layer, the eleventh convolution layer and the eleventh convolution layer are formed by a Conv convolution layer, a Bn batch normalization layer and a Mish activation function layer;

the number of channels of the first convolution layer and the second convolution layer is 32 and 64 respectively, convolution kernels are 3 multiplied by 3, and step lengths are 1 and 2 respectively;

the number of channels of the third convolution layer and the fourth convolution layer is 64 and 128 respectively, convolution kernels are 1 multiplied by 1 and 3 multiplied by 3 respectively, and step lengths are 1 and 2 respectively;

the number of channels of the fifth convolutional layer and the sixth convolutional layer is 128 and 256 respectively, the convolutional cores are 1 multiplied by 1 and 3 multiplied by 3 respectively, and the step length is 1 and 2 respectively;

the number of channels of the seventh convolutional layer and the eighth convolutional layer is 256 and 512 respectively, the convolutional cores are 1 multiplied by 1 and 3 multiplied by 3 respectively, and the step length is 1 and 2 respectively;

the number of channels of the ninth convolutional layer and the tenth convolutional layer is 512 and 1024 respectively, the convolutional cores are 1 multiplied by 1 and 3 multiplied by 3 respectively, and the step length is 1 and 2 respectively;

the number of channels of the eleventh convolutional layer is 1024, the convolutional kernel is 3 × 3, and the step length is 2;

the first combination module is formed by splicing three CBM convolutional layers and a CSP residual module; each CBM convolutional layer consists of a convolutional layer with the channel number of 64, the convolutional kernel size of 1 multiplied by 1 and the step length of 1, a batch normalization layer and a Mish activation function layer; the CSP residual module consists of two CBM convolutional layers with the channel numbers of 32 and 64 respectively, convolutional kernels of 1 multiplied by 1 and 3 multiplied by 3 respectively and the step length of 1, and the two CBM convolutional layers are connected in sequence;

the second combination module is formed by splicing three CBM convolutional layers and two CSP residual modules; each CBM convolutional layer consists of a convolutional layer with the channel number of 64, the convolutional kernel size of 1 multiplied by 1 and the step length of 1, a batch normalization layer and a Mish activation function layer; the CSP residual error module consists of two CBM convolutional layers with the channel number of 64, the convolutional kernels of 1 multiplied by 1 and 3 multiplied by 3 respectively and the step length of 1, and the two CBM convolutional layers are connected in sequence;

the third combination module is formed by splicing three CBM convolutional layers and two CSP residual modules; each CBM convolutional layer consists of a convolutional layer with the channel number of 128, the convolutional kernel size of 1 multiplied by 1 and the step length of 1, a batch normalization layer and a Mish activation function layer; the CSP residual error module consists of two CBM convolutional layers with the channel number of 128, the convolutional kernels of 1 multiplied by 1 and 3 multiplied by 3 respectively and the step length of 1, and the two CBM convolutional layers are connected in sequence;

the fourth combination module is formed by splicing three CBM convolutional layers and two CSP residual modules; each CBM convolutional layer consists of a convolutional layer with the channel number of 256, the convolutional kernel size of 1 multiplied by 1 and the step length of 1, a batch normalization layer and a Mish activation function layer; the CSP residual error module consists of two CBM convolutional layers with 256 channel numbers, 1 multiplied by 1 and 3 multiplied by 3 convolution kernels respectively and 1 step length, and the two CBM convolutional layers are connected in sequence;

the fifth combined module is formed by splicing three CBM convolutional layers and two CSP residual modules; each CBM convolutional layer consists of a convolutional layer with the channel number of 512, the convolutional kernel size of 1 multiplied by 1 and the step length of 1, a batch normalization layer and a Mish activation function layer; the CSP residual error module consists of two CBM convolutional layers with the channel number of 512, the convolutional kernels of 1 multiplied by 1 and 3 multiplied by 3 respectively and the step length of 1, and the two CBM convolutional layers are connected in sequence;

the Mish activation function is expressed as: mix ═ x × tanh (ln (1+ e)^x) In which x represents the output of the previous layer, e^xX to the power of e, ln (1+ e)^x) Denotes base e (1+ e)^x) Tan h is a hyperbolic tangent function. The Mish activation function allows small negative gradient flow when the value is negative, so that the flow of information is ensured, and the problem of gradient saturation does not exist.

1.2) building a neck network PANet network of a target detection network YOLOv 4:

the structure of the neck network PANet network is as follows in sequence: the 1 st combination module → the 2 nd combination module → the 3 rd combination module → the 4 th combination module;

the parameters of each module are set as follows:

the 1 st combination module consists of a 1 st stacking module and a convolution layer, wherein the 1 st stacking module is a stack of a 38 multiplied by 256 effective characteristic layer of a spatial pyramid structure output after convolution and sampling and a 38 multiplied by 256 effective characteristic layer of a main network, and after stacking, the picture size is unchanged, and the number of channels is doubled; the convolutional layer is an alternating convolution of five times 1 × 1 pixels and 3 × 3 pixels;

the 2 nd combined module consists of a 2 nd stacked module and a convolution layer, wherein the 2 nd stacked module is formed by stacking a 76 multiplied by 128 characteristic layer obtained by convolution and sampling of the output of the 1 st combined module and an effective characteristic layer of a main network 76 multiplied by 128, and after stacking, the size of a picture is unchanged, and the number of channels is doubled; the convolutional layer is an alternating convolution of five times 1 × 1 pixels and 3 × 3 pixels;

the 3 rd stacking module consists of a 3 rd stacking module and a convolution layer, wherein the 3 rd stacking module is a stack of the output 38 multiplied by 256 characteristic layer of the 1 st stacking module and the output down-sampled 38 multiplied by 256 characteristic layer of the second stacking module, and after stacking, the picture size is unchanged, and the number of channels is doubled; the convolutional layer is an alternating convolution of five times 1 × 1 pixels and 3 × 3 pixels;

the 4 th combined module consists of a 4 th stacked module and a convolution layer, wherein the 4 th stacked module is formed by stacking a 19 multiplied by 512 characteristic layer output from the 3 rd combined module after down sampling and a 19 multiplied by 512 characteristic layer output from a space pyramid structure, and the picture size is unchanged and the number of channels is doubled after stacking; the convolutional layer is an alternating convolution of five times 1 × 1 pixels and 3 × 3 pixels;

the CBL convolutional layer consists of Conv convolution, Bn batch normalization and Leaky _ relu activation functions;

the spatial pyramid structure adopts a mode of maximum pooling of 1 × 1, 5 × 5, 9 × 9 and 13 × 13 to perform multi-scale fusion.

1.3) building a Head network YOLO Head of a target detection network YOLOv 4:

the structure of the Head network YOLO Head is as follows in sequence: 3 × 3 convolutional layers and 1 × 1 convolutional layers;

the 3 x 3 convolutional layer is composed of a Conv convolutional layer + Bn batch normalization layer + leak _ relu activation function layer,

the 1 × 1 convolutional layer is composed of only Conv convolutional layers.

The 3 × 3 pixel convolution layer is the integration of the obtained features, and the 1 × 1 pixel convolution layer is the final output result obtained by using the features.

And 2, constructing a training data set.

2.1) collecting at least 3000 images containing a football target and having a resolution of not less than 608 x 608;

2.2) manually marking the boundary frame of the football in each image containing the football to generate annotation files corresponding to the acquired images one by one;

2.3) composing the acquired image and the annotation file into a training data set.

Step 3, training the target detection network YOLOv 4:

3.1) configuring an object detection network YOLOv4 environment, comprising different software of cuda 10.2, cudnn7.6.5, Python3.7, VisualStudio2019 and OpenCV3.4;

3.2) downloading a pre-training weight file YOLOv4.conv.137 of a target detection network YOLOv4 to a local hard disk;

3.3) obtaining the prior frame size of the constructed training data set by using a k-means clustering method, and updating the prior frame size in the target detection network YOLOv 4;

3.4) inputting a constructed training data set, and performing data augmentation on the training data set by adopting a CutMix method, namely randomly intercepting a rectangular area from one image of the data set, and replacing a corresponding rectangular area in the other image by using pixels of the rectangular area to form a new combined image so as to ensure that non-information pixels cannot appear in the image;

3.5) loading a pre-training weight file YOLOv4.conv.137 on a target detection network YOLOv4 by using a transfer learning method, namely obtaining the parameters of a trained shallow network from a large data set, and loading the parameters onto a target detection network YOLOv4, so that the target detection network YOLOv4 has the capability of identifying the general characteristics of the bottom layer, thereby saving training time and reducing the risks of under-fitting and over-fitting;

3.6) training the training data set by using the loaded target detection network YOLOv4, and realizing the following steps:

3.6.1) inputting the training data set subjected to data amplification by the CutMix method into a target detection network YOLOv 4;

3.6.2) continuously optimizing the network training parameters in a layer-by-layer training mode until the loss function of the target detection network YOLOv4 converges, as shown in fig. 4, wherein the lower curve in the figure is the training loss of the target detection network YOLOv4, and the upper curve is the average precision mean value mAP, so as to obtain the trained target detection network YOLOv4.

And 4, acquiring a video containing the football target, inputting the video into a trained YOLOv4 football target detection network for detection and labeling, and outputting the video labeled with football pixels to obtain a detection result of the football target.

The effect of the present invention will be further described with reference to simulation experiments.

1. The experimental conditions are as follows:

the hardware test platform of the simulation experiment of the invention is as follows: the CPU is Intel (R) core (TM) i7-7800X, the main frequency is 3.6GHz, the memory is 80GB, the GPU is double GeForce RTX 2080Ti, and the software platform is as follows: windows10 system.

2. And (3) analyzing the experimental content and the result:

simulation 1, training a target detection network YOLOv4 on a constructed football target training data set by using the method of the present invention, detecting and labeling the input video containing the football target by using the trained target detection network YOLOv4, and outputting the video labeled with football pixels to obtain the detection result of the football target, as shown in fig. 5. Wherein:

fig. 5(a) shows the result of detection in the normal state of the soccer ball, fig. 5(b) shows the result of detection when the soccer ball is blocked by a player, fig. 5(c) shows the result of detection when the soccer ball is blocked by a soccer net, and fig. 5(d) shows the result of detection when the soccer ball is moving at high speed.

As can be seen from FIG. 5, the target detection network YOLOv4 has good robustness to the problems of football occlusion, high-speed motion and blurring in the football video.

Simulation 2, training a target detection network YOLOv4 on the constructed football target training data set by using the method of the invention, and then detecting the frames per second FPS of the input football video by using the trained target detection network YOLOv4 to obtain the results of the frames per second FPS of the football video, as shown in fig. 6.

As can be seen from FIG. 6, the FPS number of transmission frames per second of the experimental operation is 31.3, that is, the invention can detect the football target on 31.3 frames of images per second, the detection speed is high, and the real-time property of the detection can be ensured.

In conclusion, the real-time football target detection method based on the neural network can better position and identify football targets under complex conditions, enhance the representation capability of local detail information, effectively improve the generalization of the network, further enhance the speed and the precision of football target detection and ensure the real-time performance of the football target detection.

The above description is only one specific example of the present invention in order to facilitate the understanding of the present invention by those skilled in the art, but the present invention is not limited to the scope of the specific example, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the present invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected by the present invention.

Claims

1. A real-time football target detection method based on a neural network is characterized by comprising the following steps:

(1) constructing a football target detection network YOLOv 4:

(1b) constructing a neck network PANet of a target detection network YOLOv 4;

(1c) building a Head network YOLO Head of a target detection network YOLOv 4;

(2) constructing a training data set:

(3) training the target detection network YOLOv 4:

(3a) configuring a target detection network YOLOv4 environment;

2. The method according to claim 1, wherein the backbone network CSPDarknet53 of the target detection network YOLOv4 built in (1a) has the following structural relationships:

the input layer → the first buildup layer → the second buildup layer → the first module → the third buildup layer → the fourth buildup layer → the second module → the fifth buildup layer → the sixth buildup layer → the third module → the seventh buildup layer → the eighth buildup layer → the fourth module → the ninth buildup layer → the tenth buildup layer → the fifth module → the eleventh buildup layer, wherein:

the first combination module is formed by splicing three CBM convolutional layers with 64 channel numbers and a CSP residual module;

the second combined module is formed by splicing three CBM convolutional layers with 64 channel numbers and two CSP residual modules;

the third combination module consists of three CBM convolutional layers with the channel number of 128 and eight CSP residual modules which are spliced;

the fourth combination module consists of three CBM convolutional layers with 256 channels and eight CSP residual modules which are spliced;

the fifth combined module is formed by splicing three CBM convolutional layers with the channel number of 512 and four CSP residual modules;

the CSP residual module consists of two CBM convolutional layers, and the output of the second CBM convolutional layer is connected with the input of the first CBM convolutional layer;

the Mish activation function is expressed as: mix ═ x × tanh (ln (1+ e)^x) In which x represents the output of the previous layer, e^xX to the power of e, ln (1+ e)^x) Denotes base e (1+ e)^x) Tan h is a hyperbolic tangent function.

3. The method according to claim 1, wherein the neck network PANet of the target detection network YOLOv4 built in (1b) has the structure: 1 st combination module → 2 nd combination module → 3 rd combination module → 4 th combination module, wherein:

the 1 st combination module consists of a 1 st stacking module and a convolution layer, wherein the 1 st stacking module is a stack of a feature layer obtained by CBL (cubic boron nitride) and sampling of the output of a spatial pyramid structure and an effective feature layer of a main network 38 x 512, and the convolution layer is an alternate convolution of 1 x 1 pixels and 3 x 3 pixels;

the 2 nd combination module is composed of a 2 nd stacking module and a convolution layer, wherein the 2 nd stacking module is formed by adding a sampled characteristic layer and an effective characteristic layer of a main network 76 x 256 to the output of the 1 st combination module through CBL, and the convolution layer is formed by alternately convolving 1 x 1 pixels and 3 x 3 pixels;

the 3 rd combination module consists of a 3 rd stacking module and a convolution layer, wherein the 3 rd stacking module is a stack of an output characteristic layer of the 1 st combination module and a characteristic layer after the output of the 2 nd combination module is subjected to down sampling, and the convolution layer is an alternate convolution of 1 × 1 pixels and 3 × 3 pixels;

the 4 th combination module consists of a 4 th stacking module and a convolution layer, wherein the 4 th stacking module is a stack of a characteristic layer after the output of the 3 rd combination module is subjected to down sampling and an output characteristic layer of a spatial pyramid structure, and the convolution layer is the alternate convolution of 1 × 1 pixels and 3 × 3 pixels;

4. The method of claim 1, wherein the neck network YOLO Head of the target detection network yollov 4 constructed in (1c) has the structure: 3 × 3 convolutional layers and 1 × 1 convolutional layers are connected;

the 1 × 1 convolutional layer is composed of only Conv convolutional layers.

5. The method according to claim 1, wherein the target detection network YOLOv4 environment is configured in (3a) and comprises cuda 10.2, cudnn7.6.5, python3.7, VisualStudio2019 and opencv3.4.

6. The method of claim 1, wherein the step (3d) of augmenting the input training set by using the CutMix method randomly intercepts a rectangular region from one image, and replaces the corresponding rectangular region in the other image with pixels of the rectangular region to form a new combined image.

7. The method of claim 1, wherein in (3e), a pre-training weight file YOLOv4.conv.137 is loaded on the target detection network YOLOv4 by using a migration learning method, and parameters of a well-trained shallow network are obtained from a large data set and loaded on the target detection network YOLOv4, so that the target detection network YOLOv4 has the capability of recognizing underlying general features, thereby saving training time and reducing the risk of under-fitting and over-fitting.

8. The method of claim 1, wherein the training data set is trained in (3f) using the loaded target detection network YOLOv4, as follows:

(3f1) inputting a data set subjected to data amplification by a CutMix method into a target detection network YOLOv 4;

(3f2) and (3) adopting a layer-by-layer training mode, firstly learning and identifying the universal characteristics of the bottom layer, then quickly learning the high-dimensional characteristics of the data set after data amplification, and continuously optimizing network training parameters until the loss function of the target detection network YOLOv4 is converged to obtain the trained target detection network YOLOv4.