CN111291617B

CN111291617B - Badminton event video highlight segment extraction method based on machine learning

Info

Publication number: CN111291617B
Application number: CN202010031201.3A
Authority: CN
Inventors: 王美丽; 罗键昆; 陶树; 王亦涵
Original assignee: Northwest A&F University
Current assignee: Northwest A&F University
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2023-11-17
Anticipated expiration: 2040-01-13
Also published as: CN111291617A

Abstract

The invention discloses a method for extracting video highlight of badminton events, which comprises the following steps: acquiring a badminton video; performing migration learning by adopting a Keras framework to obtain a visual angle classification model; dividing the badminton video according to the visual angle classification model to obtain a broadcasting visual angle badminton video segment; determining the center of gravity of the athlete through a YOLOv3 target detection model; and determining the average speed of the whole player in the badminton video segments with the broadcasting viewing angles according to the gravity centers of the athletes, and taking the several badminton video segments with the broadcasting viewing angles, the average speed of which is the largest, as the badminton video highlight segments. The invention relates to a segment segmentation link taking visual angle classification as a core and a highlight segment extraction link taking judgment of the overall speed of a player as a core, which finally realize a highlight segment extraction method of a badminton event video, so that a user can directly enjoy a game highlight segment in the badminton event video, and the time cost of the user is saved.

Description

Badminton event video highlight segment extraction method based on machine learning

Technical Field

The invention relates to the technical field of sports video processing, in particular to a method for extracting video highlight of a badminton event.

Background

Sports videos have a huge audience population and huge commercial application prospects. With the popularity of mobile devices and the internet, users' demand for sports video has moved from direct viewing of complete video to the need for diversification of video. For example, direct viewing of highlights and emergencies of a game in a video, highlight extraction and anomaly detection in corresponding video analytics techniques, and the like. In the field of sports video analysis, more sports are researched, such as basketball, football, volleyball, tennis and the like, while in the field of badminton, less research is performed, and a large gap is reserved for filling.

Highlight extraction techniques are currently based primarily on panoramic keyframes. For example, guangdong Xin-Cheng electric technology Co-Ltd extracts and matches the front and rear multi-frame moving targets, regards the matched multi-frame as one frame, and connects the concentrated video frames after repeating for a plurality of times to form a highlight. With world-wide popularization of shuttlecocks, research in recent years has seen an increasing trend. Because of the many similarities between shuttlecocks and tennis venues and methods of play, many related techniques for shuttlecock video have been borrowed from the study of tennis videos.

The existing motion video key information capturing technology mainly focuses on key frame extraction, and no research on directly and effectively extracting highlight clips in videos appears. The key frames extracted from the video cannot show the characteristics of the motion trail and the corresponding motion direction of the foreground target, and belong to static key information in the video. The highlight is a multi-segment video sequence when the state of being segmented from the whole video is intense, can show the motion characteristic, belongs to dynamic key information in the video, and can meet the requirement that a user can watch the dynamic key information of the video without manual processing.

Disclosure of Invention

The embodiment of the invention provides a method for extracting video highlight of a badminton event, which is used for solving the problems in the background technology.

The embodiment of the invention provides a method for extracting video highlight of a badminton event, which comprises the following steps:

acquiring a badminton video;

performing migration learning by adopting a Keras framework to obtain a visual angle classification model; dividing the badminton video according to the visual angle classification model to obtain a broadcasting visual angle badminton video segment;

determining the center of gravity of the athlete through a YOLOv3 target detection model; and determining the average speed of the whole player in the badminton video segments with the broadcasting viewing angles according to the gravity centers of the athletes, and taking the several badminton video segments with the broadcasting viewing angles, the average speed of which is the largest, as the badminton video highlight segments.

Further, the badminton video includes: broadcasting the visual angle badminton video, the field-edge visual angle badminton video and the useless visual angle badminton video.

Further, the method for extracting the video highlight of the badminton event provided by the embodiment of the invention further comprises the following steps: adopting a K-Means method to perform image clustering on the badminton videos at three visual angles, and verifying the feasibility of badminton video classification; the method specifically comprises the following steps:

performing dimension reduction on the image by adopting a Principal Component Analysis (PCA) method;

designating the cluster number as 3, and respectively corresponding to three visual angles of the badminton video;

image clustering is carried out by adopting a K-Means method;

and according to the clustering result, verifying the classification feasibility of the three visual angles of the badminton video.

Further, performing migration learning by adopting a Keras framework to obtain a visual angle classification model; the method specifically comprises the following steps:

performing transfer learning by adopting a Keras deep learning framework, and using a pre-training image classification model on an ImageNet image library as a pre-training model of the transfer learning; using MobileNet as the basic network structure of three visual angle classifications; replacing the traditional 3D convolution with a depth separable Depthwise convolution;

experiments are carried out on the effect of the pre-training model, and parameters are set to include: the input image size is 224×224; adding a full connection layer with the size of 3; scaling the image pixel values to [0,1]; the optimizer is SGD, the learning rate is 0.0001, and the momentum is 0.9; the loss function is a classification cross entropy; batch size 32; the loss of the verification set is not reduced after iteration for 5 times, and half of learning rate is reduced; the training is stopped after 30 iterations of verifying that the set loss is not reduced.

Further, dividing the badminton video according to the visual angle classification model to obtain a broadcasting visual angle badminton video segment; the method specifically comprises the following steps:

reading each frame of a section of complete badminton event video in sequence, and when each frame is read, performing the following processing: if the view angle of the frame is predicted as a broadcast view angle and the previous frame is not the broadcast view angle, creating a storage queue for storing continuous frames of video clips taking the frame as a first frame, and writing the frame into the storage queue; if the view of the frame and the previous frame are both predicted as broadcast views, the frame is written into the newly created store queue.

Further, the method for extracting the video highlight of the badminton event provided by the embodiment of the invention further comprises the following steps: visualizing the visual angle classification model; the method specifically comprises the following steps:

and visualizing the visual angle classification model by adopting a t-SNE dimension reduction method.

Further, determining the gravity center of the athlete through a YOLOv3 target detection model; the method specifically comprises the following steps:

detecting each frame in the badminton video fragment with a target of 'person' through a YOLOv3 target detection model, and defining the gravity center according to a detection frame; the head of the default person for target detection is on the upper part, the top end point of the target frame is taken as the vertex of a triangle, the bottom line of the target frame forms the other two points of the triangle, and the upper left corner point of the target frame is (x) ₁ ，y ₁ ) The right lower corner of the target frame is (x) ₂ ，y ₂ ) The center of gravity is:

further, according to the center of gravity of the athlete, determining the average speed of the whole player in the badminton video segment with the broadcasting visual angle; the method specifically comprises the following steps:

the moving speed of the last player detected in two adjacent frames is approximately the whole speed, and the gravity centers of the last player detected in the ith and (i+1) th adjacent frames are respectively (p) _i ，q _i )，(p _i+1 ，q _i+1 ) The overall speed of the two frames of players is:

the average speed of the player for the entire video clip is defined as:

wherein N is the total frame number of the video clips, and T is the total playing duration of the video clips.

Further, the badminton video clips with the maximum average speed of the overall player are used as badminton video highlight clips; the method specifically comprises the following steps:

and sequencing the broadcast visual angle badminton video clips according to the average speed value of the overall player from high to low, and taking the first ten broadcast visual angle badminton video clips as badminton video highlight clips.

Further, the method for extracting the video highlight of the badminton event provided by the embodiment of the invention further comprises the following steps: evaluating the badminton video highlight by comparing the audio keywords; the method specifically comprises the following steps:

the evaluation criteria are positive evaluation of whether there is surprise sound, shout sound, fierce applause sound, and explanation of the audience in the badminton video highlight.

The embodiment of the invention provides a method for extracting video highlight of badminton events, which has the following beneficial effects compared with the prior art:

the invention realizes a highlight extraction method of the badminton event video through a segment segmentation link taking visual angle classification as a core and a highlight extraction link taking the overall speed of a player as a core, is different from key frame extraction, and can lead a user to directly and selectively enjoy the game highlight in the badminton event video aiming at extracting the highlight from the whole badminton event video, so that the user can watch the key segment of the video without manual processing to a certain extent, and the time cost of the user is saved.

Drawings

FIG. 1 is a flowchart of a method for extracting video highlight of a badminton event according to an embodiment of the present invention;

fig. 2a is a view of a badminton video broadcast according to an embodiment of the present invention;

FIG. 2b is a view of a field edge of a badminton video provided by an embodiment of the present invention;

FIG. 2c is a view of a shuttlecock video in accordance with an embodiment of the present invention;

FIG. 3 is a comparison of a conventional convolution operation provided by an embodiment of the present invention with a Depthwise convolution;

FIG. 4 is a schematic view of a segment segmentation of a badminton video provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of a center of gravity of a target frame according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, an embodiment of the present invention provides a method for extracting a video highlight of a badminton event, the method comprising:

step 1, obtaining a badminton video.

Step 2, performing migration learning by adopting a Keras framework to obtain a visual angle classification model; and dividing the badminton video according to the visual angle classification model to obtain a broadcasting visual angle badminton video segment.

Step 2, determining the gravity center of the athlete through a YOLOv3 target detection model; and determining the average speed of the whole player in the badminton video segments with the broadcasting viewing angles according to the gravity centers of the athletes, and taking the several badminton video segments with the broadcasting viewing angles, the average speed of which is the largest, as the badminton video highlight segments.

The specific processes of the steps 1 to 3 are as follows:

the invention aims to provide a service for saving a great deal of time for directly watching the badminton game video for the audience user of the badminton game video, and enabling the user to selectively and directly enjoy the video highlight of the badminton game.

The first step is to collect badminton videos

The collection subjects were 2018 world lupin major public games for training of classification models and 2019 world lupin first six public games for testing of segment segmentation and highlight segment extraction. The collection channels are outube and outube, which are mainly performed by manually operating a hundred-degree net disk and a GihosofTubeGet.

The second step is to segment the badminton video

Firstly, carrying out a pre-experiment through clustering, and verifying the feasibility of classification proposed by the invention. And then directly performing migration learning through a Keras framework to generate a classification model. The classification model is then subjected to various performance metrics and visualizations. And then, adopting a t-SNE dimension reduction method to visualize the distribution of the data after various optimizations. Finally, the segmentation of the badminton video segment is realized through Python+OpenCV programming.

The badminton video is divided on the basis of classifying three viewing angles, namely a broadcasting viewing angle, a field-edge viewing angle and a useless viewing angle, shown in fig. 2. The broadcasting view angle, namely the view angle presented during rebroadcasting of a television or a network platform, is characterized in that a camera is in a depression field; the field side view angle, namely the view angle that the camera is placed beside the field, is characterized in that the camera is horizontally arranged on the field; the useless view angle, except the two view angles, is characterized in that the complete field or the complete batting track cannot be observed.

The K-Means method is used for image clustering of three visual angles to verify classification feasibility. And performing migration learning by using a Keras framework to establish a classification model, comprehensively comparing and selecting MobileNet as the classification model, and performing performance measurement and visualization on the classification model. And finally, segment segmentation is carried out on the badminton video by utilizing the classification model obtained through training, so as to prepare the basis for the subsequent highlight segment extraction.

View class feasibility verification

For a computer, it is difficult to understand the content of the picture and the meaning of the picture, so that a pre-experiment is needed to verify whether classification of three visual angles is reasonable. The more pictures that are used to train the PCA, the larger the size of the cluster model generated. Comprehensively considering the size of the generated clustering model, carrying out pre-experiments on 20 images taken by three classifications respectively, wherein the total number of the images is 60. The feasibility of the three categories was verified using a clustering algorithm as follows: (1) Performing dimension reduction on the image by using a Principal Component Analysis (PCA) method; (2) designating the cluster number as 3, corresponding to three viewing angles; (3) Image clustering was performed using a K-Means clustering K-Means method. The clustering results indicate that classification of three views is possible.

Establishing a visual angle classification model

According to the invention, the Keras deep learning framework is adopted to realize transfer learning, and the pre-training image classification model on the ImageNet image library is used as the pre-training model for transfer learning, so that the classification effect can be greatly improved and the training convergence speed can be accelerated. Experiments were performed on several pre-trained model effects one by one. The main parameters set for experimental model training are as follows: (1) the input image size is 224×224; (2) adding a full connection layer with the size of 3; (3) scaling the image pixel values to [0,1]; (4) The optimizer is SGD, the learning rate is 0.0001, and the momentum is 0.9; (5) the loss function is a class cross entropy; (6) a batch size of 32; (7) The loss of the verification set is not reduced after iteration for 5 times, and half of learning rate is reduced; (8) iterating 30 times to verify that the set loss is not reduced and stopping training.

The test indexes mainly comprise model size, verification set accuracy, verification set loss and FPS, and the computer display card for testing is NVIDIA GeForce GTX1080. Because of pursuing real-time performance in sports video analysis, faster operation speed is required, the system selects MobileNet with higher FPS as the basic network structure of three visual angle classifications. The operation speed of the MobileNet is very fast, the main idea is to use separable convolution, so that the calculation cost is greatly reduced, the Depthwise convolution is adopted to replace the traditional 3D convolution, and the redundant expression of the convolution kernel is reduced, as shown in figure 3.

Badminton video segment segmentation

Before the extraction of the shuttlecock highlight, the whole complete shuttlecock video is divided into useful segments, which mainly consist of broadcasting view angles and represent segments for the player to move and hit. The classification model obtained by the transfer learning can effectively divide useful badminton video clips. FIG. 4 shows a flow of badminton video segment segmentation: a complete badminton video is divided into badminton segments mainly composed of broadcasting view angles, and the segments form effective parts in the badminton video, so that a foundation is laid for extracting highlight segments later. The segment segmentation process comprises the following steps: reading each frame of a section of complete badminton event video in sequence, and when each frame is read, performing the following processing: if the view angle of the frame is predicted as a broadcast view angle and the previous frame is not the broadcast view angle, creating a storage queue S for storing continuous frames of video clips taking the frame as a first frame, and writing the frame into the storage queue; if the view of the frame and its previous frame are both predicted as broadcast views, the frame is written into the newly created store queue. Other cases are not handled. The number of final storage queues, i.e. the number of highlight segments divided, corresponds to one video segment per storage queue.

The third step is to extract the highlight.

The concept of the gravity center and the overall speed of the player is defined, the gravity center of the player is calculated through a YOLOv3 target detection model, the overall speed of the player is calculated, ten segments with the largest speed average value are taken as highlight segments, and finally the highlight segments are evaluated through comparison of audio keywords.

According to the method, a YOLOv3 target detection model is used for detecting that each frame in the badminton fragments targets a person, the gravity center is defined according to the detection frame, the moving speed of a player is obtained according to the front frame and the rear frame, the average value is calculated after the speed weighting of each fragment is counted, and finally the sequence of the highlight fragments is obtained according to the descending sequencing of the speed weighted average value of each fragment.

Target detection

YOLOv3 references the structure of the residual network, is a leading real-time target detection model in the industry at present, and the system uses the YOLOv3 pre-training network to detect the target of a person.

Speed of movement of player

Target detection default person's head is up, uses target frame top terminal point as triangle topThe points, the bottom line of the target frame, constitute the other two points of the triangle. The upper left corner of the target frame is (x) ₁ ，y ₁ ) The right lower corner of the target frame is (x) ₂ ，y ₂ ) As shown in FIG. 5, the center of gravity is

Since the detected players in the preceding and following frames are not necessarily the same person, it is necessary to consider the players in each frame as a whole and then make an approximate calculation of the overall speed, since it is found that the target most likely to be a player may rebound. The speed of movement of the last player detected in two adjacent frames is approximated as an overall speed. Let the center of gravity of the last player detected in the i, i+1 th two adjacent frames be (p) _i ，q _i )，(p _i+1 ，q _i+1 ) The overall speed of the players for these two frames is

The average speed of the player for the entire video clip is defined as:

Badminton event video highlight extraction

Highlight extraction allows the user to directly view the essence portion of the entire video. And filtering out a plurality of fragments of a section of badminton video, wherein the fragments with shorter time are filtered out, and the time threshold is set to 15 seconds, so that the interference is reduced. And respectively calculating average speeds of players of the filtered residual video clips, and finally sequencing the video clips according to average speed values of the players from high to low. According to the verification standard, the 54 extracted highlight segments obtained through final statistics are judged to be truly highlight, and 93.10% is occupied, so that the highlight segment extraction method is effective.

In summary, the present invention relates to the field of computer vision, and relates to a technology for capturing key information of sports video combining computer image classification, video segmentation, target tracking, speed analysis, etc., and a system for extracting a wonderful video clip of a game from a badminton event video is designed mainly for the problems that watching a lengthy badminton event video is time-consuming for a common user, and professional users such as athletes and coaches have great demands on key data statistics. The method is different from key frame extraction, and aims at extracting highlight from the whole badminton event video, so that a user can directly and selectively enjoy the highlight in the badminton event video. Specifically, the invention realizes the highlight extraction method of the badminton event video through the segment segmentation link taking the visual angle classification as the core and the highlight extraction link taking the overall speed of the player as the core, and can enable the user to directly and selectively enjoy the game highlight in the badminton event video. The key segments of the video can be watched by the user without manual processing to a certain extent, and the time cost of the user is saved.

The foregoing disclosure is only a few specific embodiments of the present invention and various changes and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention, and it is intended that the invention also includes such changes and modifications as fall within the scope of the claims and their equivalents.

Claims

1. The method for extracting the video highlight of the badminton event is characterized by comprising the following steps of:

acquiring a badminton video;

performing migration learning by adopting a Keras framework to obtain a visual angle classification model; dividing the badminton video by using a visual angle classification model to obtain a badminton video segment with a broadcasting visual angle;

invoking a YOLOv3 target detection model to determine the center of gravity of the athlete; according to the center of gravity of the athlete, determining the average speed of the whole player in the badminton video segments with the broadcasting view angles, and taking the badminton video segments with the broadcasting view angles, with the maximum average speed of the whole player, as badminton video highlight segments;

performing migration learning by adopting a Keras framework to obtain a visual angle classification model; the method specifically comprises the following steps:

performing migration learning by adopting a Keras deep learning framework, using a classification model trained on a large image dataset ImageNet as a pre-training model, and adopting MobileNet as a basic network structure of three visual angle classifications; replacing the traditional 3D convolution with a depth separable Depthwise convolution;

experiments are carried out on the effect of the pre-training model, and parameters are set to include: the input image size is 224×224; adding a full connection layer with the size of 3; scaling the image pixel values to [0, -1]; the optimizer is SGD, the learning rate is 0.0001, and the momentum is 0.9; the loss function is a classification cross entropy; batch size 32; the loss of the verification set is not reduced after iteration for 5 times, and half of learning rate is reduced; iterating for 30 times to verify that the loss of the set is not reduced and stopping training;

dividing the badminton video by using the visual angle classification model to obtain a badminton video segment with a broadcasting visual angle; the method specifically comprises the following steps:

reading each frame of a section of complete badminton event video in sequence, and when each frame is read, performing the following processing: if the view angle of the frame is predicted as a broadcast view angle and the previous frame is not the broadcast view angle, creating a storage queue for storing continuous frames of video clips taking the frame as a first frame, and writing the frame into the storage queue; if the view angles of the frame and the previous frame are both predicted as broadcast view angles, writing the frame into a newly created storage queue;

the method comprises the steps of calling a YOLOv3 target detection model, and determining the gravity center of an athlete; the method specifically comprises the following steps:

determining the average speed of the whole player in the badminton video segment at the broadcast viewing angle according to the gravity center of the player; the method specifically comprises the following steps:

the moving speed of the last player detected in two adjacent frames is approximated to the overall speed, and the center of gravity of the last player detected in the i, i+1 th two adjacent frames is set as (p _i ，q _i )，(p _i+1 ，q _i+1 ) The overall speed of the two frames of players is:

the average speed of the player for the entire video clip is defined as:

2. The method for extracting a highlight of a badminton event video according to claim 1, wherein the badminton video comprises: broadcasting the visual angle badminton video, the field-edge visual angle badminton video and the useless visual angle badminton video.

3. The method for extracting a video highlight of a badminton event according to claim 2, further comprising: adopting a K-Means method to perform image clustering on the badminton videos at three visual angles, and verifying the feasibility of badminton video classification; the method specifically comprises the following steps:

image clustering is carried out by adopting a K-Means method;

4. The method for extracting a video highlight of a badminton event according to claim 1, further comprising: visualizing the visual angle classification model; the method specifically comprises the following steps:

5. The method for extracting video highlight of badminton event according to claim 1, wherein the several video segments of badminton at broadcast viewing angles with the maximum average speed of the overall player are used as the video highlight of badminton; the method specifically comprises the following steps:

6. The method for extracting a video highlight of a badminton event as claimed in claim 5, further comprising: evaluating the badminton video highlight by comparing the audio keywords; the method specifically comprises the following steps: