CN107481270B

CN107481270B - Table tennis target tracking and trajectory prediction method, device, storage medium and computer equipment

Info

Publication number: CN107481270B
Application number: CN201710682442.2A
Authority: CN
Inventors: 任杰; 盛斌; 施之皓; 张本轩; 杨靖; 侯爽
Original assignee: Shanghai Jiaotong University; Shanghai University of Sport
Current assignee: Shanghai Jiaotong University; Shanghai University of Sport
Priority date: 2017-08-10
Filing date: 2017-08-10
Publication date: 2020-05-19
Anticipated expiration: 2037-08-10
Also published as: CN107481270A

Abstract

The invention relates to a table tennis target tracking and trajectory prediction method, a table tennis target tracking and trajectory prediction device, a storage medium and computer equipment. Acquiring a frame of image shot by the two cameras at the same time respectively for the tracking target, and extracting a candidate area corresponding to the tracking target from the image. And inputting the candidate area into a preset tracking model for processing to obtain a bounding box corresponding to the tracking target. And acquiring two-dimensional coordinates of the center of the bounding box corresponding to the tracking target respectively shot by the two cameras at the same moment, and calculating the three-dimensional coordinates of the center of the bounding box according to the projection matrix of the cameras. And acquiring three-dimensional coordinates of the bounding boxes at continuous moments to form a continuous coordinate sequence, inputting the continuous coordinate sequence into a recurrent neural network (LSTM) for calculation to generate a subsequent coordinate sequence, and obtaining the track of the tracking target according to the coordinate sequence. The target position is tracked by adopting a preset tracking model, and the accurate prediction of the motion trail of the tracked target can be realized by combining the advantage that the LSTM can effectively analyze the time sequence characteristics.

Description

Table tennis target tracking and trajectory prediction method, device, storage medium and computer equipment

Technical Field

The invention relates to the technical field of image processing, in particular to a method, a device, a storage medium and computer equipment for table tennis target tracking and trajectory prediction.

Background

In the design of table tennis robot systems, there are two problems that need to be solved. One is the problem of target tracking, i.e. predicting the position that may occur in the next frame of a tracked target given the position of the last frame of the tracked target. The other is the problem of trajectory prediction, namely, a small section of table tennis coordinate sequence is given, and a subsequent coordinate sequence is automatically generated.

The problem of target tracking, as a classic problem in computer vision, has continued to develop significantly in the last few decades. From the beginning, Lucas-Kanade trackers, mean-shift trackers and the like based on pure computer vision methods, to more complex trackers which integrate detection and learning ideas thereof later, and to the tracking algorithm based on deep learning nowadays. The main deep learning model currently used for tracking is based on CNN, i.e. convolutional neural networks. In a general CNN-based tracking algorithm, a CNN model mainly serves as a feature extractor (feature extractor). The bounding box obtained by the current tracking algorithm is not accurate enough, and the inaccurate bounding box not only means the error of the position information, but also directly causes the whole tracking frame to generate drift and even lose the target. When the target tracking has errors, the error of the track prediction can be directly caused.

Disclosure of Invention

In view of the above, it is necessary to provide a table tennis target tracking and trajectory prediction method, apparatus, storage medium and computer device.

A table tennis target tracking and trajectory prediction method, the method comprising:

acquiring a frame of image shot by two cameras for a tracking target at the same moment;

extracting a candidate region corresponding to the tracking target from the image;

inputting the candidate area into a preset tracking model for processing to obtain a bounding box corresponding to the tracking target;

acquiring two-dimensional coordinates of the bounding box center corresponding to the tracking target respectively shot by the two cameras at the same moment, and calculating the three-dimensional coordinates of the bounding box center of the tracking target corresponding to the moment according to the camera projection matrix;

acquiring three-dimensional coordinates corresponding to bounding boxes at continuous moments to form a continuous coordinate sequence, inputting the continuous coordinate sequence into a recurrent neural network (LSTM) for calculation, and generating a subsequent coordinate sequence;

and obtaining the track of the tracking target according to the continuous coordinate sequence and the subsequent coordinate sequence.

In one embodiment, the method further comprises:

inputting the calculated three-dimensional coordinates of the center of the bounding box of the tracking target corresponding to the moment into an LSTM (local Scale TM) for calculation, and predicting the three-dimensional coordinates of the tracking target in the next frame of image shot by the two cameras;

and taking the area containing the three-dimensional coordinates as a candidate area of a tracking target in the next frame of image.

In one embodiment, the inputting the candidate region into a preset tracking model for processing to obtain a bounding box corresponding to the tracking target includes:

inputting the candidate region into a preset convolutional neural network model, and processing to obtain a bounding box of a tracking target in the image;

inputting the bounding box of the tracking target in the image into a preset regression layer to perform regression processing, and obtaining a regressed bounding box corresponding to the tracking target, wherein the preset regression layer comprises a low-layer convolution layer of the preset convolution neural network model.

In one embodiment, the extracting, from the image, a candidate region corresponding to the tracking target includes:

and extracting a candidate region corresponding to the tracking target from the image by adopting a background subtraction method.

In one embodiment, the process of creating the camera projection matrix includes:

respectively establishing a world coordinate system and a camera coordinate system;

acquiring an internal parameter matrix and an external parameter matrix of a camera;

and establishing a camera projection matrix according to the internal parameter matrix and the external parameter matrix, wherein the camera projection matrix can convert the two-dimensional coordinates of the camera coordinate system into the three-dimensional coordinates of the world coordinate system.

A table tennis target tracking and trajectory prediction apparatus, the apparatus comprising:

the camera shooting module is used for acquiring a frame of image shot by the two cameras on the tracking target at the same moment;

the tracking target candidate region weighting module is used for extracting a candidate region corresponding to the tracking target from the image;

a tracking target bounding box obtaining module, configured to input the candidate region into a preset tracking model for processing to obtain a bounding box corresponding to the tracking target;

the tracking target bounding box three-dimensional coordinate calculation module is used for acquiring two-dimensional coordinates of the bounding box center corresponding to the tracking target respectively shot by the two cameras at the same moment, and then calculating the three-dimensional coordinates of the bounding box center of the tracking target corresponding to the moment according to the camera projection matrix;

the coordinate sequence generation module is used for acquiring three-dimensional coordinates corresponding to bounding boxes at continuous moments to form a continuous coordinate sequence, inputting the continuous coordinate sequence into a recurrent neural network (LSTM) for calculation, and generating a subsequent coordinate sequence;

and the track generation module of the tracking target is used for obtaining the track of the tracking target according to the continuous coordinate sequence and the subsequent coordinate sequence.

In one embodiment, the apparatus further comprises:

the three-dimensional coordinate prediction module of the tracking target is used for inputting the calculated three-dimensional coordinate of the center of the bounding box of the tracking target corresponding to the moment into the LSTM for calculation and predicting the three-dimensional coordinate of the tracking target in the next frame of image shot by the two cameras;

and the candidate region acquisition module of the tracking target is used for taking the region containing the three-dimensional coordinates as a candidate region of the tracking target in the next frame of image.

In one embodiment, the tracked target bounding box obtaining module includes:

the convolutional neural network module is used for inputting the candidate region into a preset convolutional neural network model and processing the candidate region to obtain a bounding box of a tracking target in the image;

and the regression layer module is used for inputting the bounding box of the tracking target in the image into a preset regression layer to perform regression processing to obtain a regressed bounding box corresponding to the tracking target, wherein the preset regression layer comprises a low-layer convolution layer of the preset convolution neural network model.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

According to the table tennis target tracking and track predicting method, device, storage medium and computer equipment, images shot by the two cameras at the same time for the tracked target are obtained, the candidate area corresponding to the tracked target is extracted from the images, and the candidate area is input into the CNN model to be processed to obtain the bounding box corresponding to the tracked target. Because two cameras shoot simultaneously, two images are obtained at each moment, and then two bounding boxes are obtained, and the two-dimensional coordinates of the centers of the two bounding boxes corresponding to the tracking target in the image at the same moment are combined with the projection matrix of the camera to calculate the three-dimensional coordinates of the bounding box of the tracking target corresponding to the moment. And acquiring three-dimensional coordinates corresponding to the bounding boxes at continuous moments, and forming a continuous coordinate sequence. And inputting the continuous coordinate sequence into the LSTM for calculation to generate a subsequent coordinate sequence. And obtaining a complete track of the tracking target according to the continuous coordinate sequence and the subsequent coordinate sequence. Accurate target position tracking is carried out by adopting the CNN model, and accurate prediction of the motion trail of the tracked target can be realized by combining the advantage that the LSTM can effectively analyze the time sequence characteristics.

Drawings

FIG. 1 is a diagram of an exemplary table tennis target tracking and trajectory prediction method;

FIG. 2 is an internal block diagram of a server in one embodiment;

FIG. 3 is a flow diagram of a table tennis target tracking and trajectory prediction method in one embodiment;

FIG. 4 is a flow diagram of a table tennis target tracking and trajectory prediction method in one embodiment;

FIG. 5 is a flow chart of the method of FIG. 4 for obtaining bounding boxes;

FIG. 6 is a flow diagram of a method for building a camera projection matrix in one embodiment;

FIG. 7 is a schematic diagram of a table tennis target tracking and trajectory prediction device in one embodiment;

FIG. 8 is a schematic diagram of a table tennis target tracking and trajectory prediction device in one embodiment;

FIG. 9 is a schematic diagram of the tracking target bounding box acquisition module of FIG. 7;

fig. 10 is a schematic structural diagram of a table tennis target tracking and trajectory prediction device in one embodiment.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather should be construed as broadly as the present invention is capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

In recent years, with the development and the gradual maturity of computer vision technology, the specific application of computers in the field of sports is continuously appeared. In the table tennis sport, the table tennis is tracked in each frame of image shot by the camera, so that the position information of the table tennis is recorded, and the motion trail of the table tennis is predicted. Because table tennis balls have the characteristics of small volume, few characteristics and fast movement, tracking and prediction algorithms need to be specially designed to meet the requirements.

The table tennis target tracking and trajectory prediction method provided by the embodiment of the invention needs to be used under a specific actual environment configuration. As shown in fig. 1, assume that a player is playing a ball and a camera is used to photograph a table tennis ball 110 and a table tennis table 120. Specifically, two high-speed cameras are arranged on one side of the table tennis table, are connected in parallel through a hardware trigger, synchronously shoot the table area, and keep not moving in the whole process. In order to ensure that the three-dimensional coordinates can be accurately calculated later, the model specifications of the cameras need to be consistent. A world coordinate system is established with one corner of the table as the origin. Wherein the x-axis is along the table bottom line, the y-axis is along the table side line, and the z-axis is perpendicular to the table. It is necessary to calculate the projection matrix of the two cameras to the table in advance in order to calculate the three-dimensional coordinates from the two-dimensional coordinates obtained by the two cameras or to project the three-dimensional coordinates onto the camera plane.

In one embodiment, as shown in fig. 2, there is also provided a server comprising a processor, a non-volatile storage medium having an operating system stored therein and a table tennis target tracking and trajectory prediction apparatus for performing a table tennis target tracking and trajectory prediction method, connected by a system bus, an internal memory, a network interface. The processor is used for improving the calculation and control capacity and supporting the operation of the whole server. An internal memory for providing an environment for operation of a table tennis target tracking and trajectory prediction device in a non-volatile storage medium may have stored therein computer readable instructions that, when executed by a processor, may cause the processor to perform a table tennis target tracking and trajectory prediction method. The network interface receives a video or the like containing a tracking target.

In one embodiment, as shown in fig. 3, a table tennis target tracking and trajectory prediction method is provided, which is described by taking the application scenario in fig. 1 as an example, and includes:

step 310, acquiring a frame of image shot by the two cameras respectively to the tracking target at the same time.

Two high-speed cameras are arranged on one side of the table tennis table to synchronously shoot the table area. The tracking target is a moving table tennis ball, and at each moment, one frame of image is respectively obtained from the two cameras.

In step 320, a candidate region corresponding to the tracking target is extracted from the image.

Before tracking the tracked target in the image, how to obtain an initial bounding box of the tracked target, namely how to detect the tracked target, namely the table tennis ball, in the image is solved. The framework first looks for a possible area for a table tennis ball. Because the cameras are fixed all the time and the whole scene does not have too many moving objects, foreground regions can be extracted from images shot by the two cameras respectively in a background subtraction mode and the like, the search range is narrowed, and the regions are used as candidate regions corresponding to the tracking target.

And 330, inputting the candidate area into a preset tracking model for processing to obtain a bounding box corresponding to the tracking target.

The preset tracking model comprises: presetting a convolutional neural network model and a regression layer. Convolutional Neural Network (CNN) is one of the Network structures that are very representative in deep learning techniques. The preset convolutional neural network model is a convolutional neural network model obtained by training a convolutional neural network in advance through a group of labeled training sets and comprises a convolutional layer, a pooling layer and a full-connection layer. The preset regression layer includes: a full connection layer, a region of interest pooling layer and a lower convolutional layer in the preset convolutional neural network model. And establishing a preset regression layer, wherein after another group of labeled training sets are processed by the preset convolutional neural network model to be established, the other group of labeled training sets are trained to establish the preset regression layer after the regression processing is performed on the regression layer.

And respectively inputting the candidate regions extracted from the image into a preset tracking model to be processed to obtain a bounding box corresponding to the tracking target. Specifically, the candidate regions are sequentially input into the CNN for object detection, and a probability value is output to indicate whether the candidate regions contain an object. If the input candidate region does not find a target, all candidate regions need to be input to the CNN detection target. If the target is found in the previous frame image, only the candidate area near the position of the target in the previous frame image is input into the CNN, so that unnecessary calculation can be reduced, and the efficiency can be improved.

And 340, acquiring two-dimensional coordinates of the centers of the bounding boxes corresponding to the tracked targets respectively shot by the two cameras at the same moment, and calculating the three-dimensional coordinates of the centers of the bounding boxes of the tracked targets corresponding to the moment according to the camera projection matrix.

After the preset tracking model processing, the bounding boxes corresponding to the tracking targets in the images respectively shot by the two cameras at the same moment are obtained, and then the two-dimensional coordinates of the centers of the bounding boxes are obtained according to the bounding boxes. And then, calculating the three-dimensional coordinate of the center of the bounding box of the tracking target corresponding to the moment in a world coordinate system according to the camera projection matrix. Wherein the camera projection matrix is pre-calculated. Specifically, a world coordinate system and a camera coordinate system are respectively established in advance, and then an internal parameter matrix and an external parameter matrix of the camera are obtained. And establishing a camera projection matrix according to the internal parameter matrix and the external parameter matrix, wherein the camera projection matrix can convert the two-dimensional coordinates of the camera coordinate system into the three-dimensional coordinates of the world coordinate system.

And 350, acquiring three-dimensional coordinates corresponding to the bounding boxes at continuous moments to form a continuous coordinate sequence, inputting the continuous coordinate sequence into a recurrent neural network (LSTM) for calculation, and generating a subsequent coordinate sequence.

And (4) obtaining the bounding boxes corresponding to the tracking targets at the continuous moments by calculating the images shot at each moment, and then obtaining the three-dimensional coordinates corresponding to the centers of the bounding boxes. And sequentially forming a continuous coordinate sequence by the three-dimensional coordinates of the centers of the bounding boxes corresponding to the continuous time instants. Inputting the continuous coordinate sequence into a recurrent neural network LSTM to automatically generate a subsequent coordinate sequence. LSTM (Long Short-Term Memory) refers to a two-way Long and Short Term Memory network model, which is a time recursive neural network. The bidirectional long and short term memory network model comprises a forward long and short term memory network model and a backward long and short term memory network model.

And step 360, obtaining the track of the tracking target according to the continuous coordinate sequence and the subsequent coordinate sequence.

The subsequent coordinate sequence obtained by calculating the continuous coordinate sequence is input into the LSTM, and the continuous coordinate sequence and the subsequent coordinate sequence form the track of the tracking target, so that the track prediction and the drop point prediction can be carried out on the tracking target such as a table tennis.

In this embodiment, images shot by the two cameras at the same time for the tracking target are obtained, a candidate region corresponding to the tracking target is extracted from the images, and the candidate region is input to a preset tracking model and processed to obtain a bounding box corresponding to the tracking target. Because two cameras shoot simultaneously, two images are obtained at each moment, and then two bounding boxes are obtained, and the two-dimensional coordinates of the centers of the two bounding boxes corresponding to the tracking target in the image at the same moment are combined with the projection matrix of the camera to calculate the three-dimensional coordinates of the bounding box of the tracking target corresponding to the moment. And acquiring three-dimensional coordinates corresponding to the bounding boxes at continuous moments, and forming a continuous coordinate sequence. And inputting the continuous coordinate sequence into the LSTM for calculation to generate a subsequent coordinate sequence. And obtaining a complete track of the tracking target according to the continuous coordinate sequence and the subsequent coordinate sequence. The method adopts a preset tracking model to accurately track the position of the target, and combines the advantage that the LSTM can effectively analyze the time sequence characteristics, so that the accurate prediction of the motion trail of the tracked target can be realized.

In one embodiment, as shown in fig. 4, a table tennis target tracking and trajectory prediction method further comprises:

step 370, inputting the three-dimensional coordinates of the bounding box center of the tracking target corresponding to the calculated time into the LSTM for calculation, and predicting the three-dimensional coordinates of the tracking target in the next frame of image captured by the two cameras.

The LSTM may be used as a model for tracking a target, and the coordinates obtained by processing the currently captured image may be input to the LSTM to predict the coordinates of the target bounding box to be tracked in the next frame image. Specifically, the obtained coordinates are input into the LSTM model in each cycle, and the LSTM model is enabled to output only one parameter of the hybrid gaussian model, which is similar to a kalman filter, to predict the position of the tracking target at which the next frame may appear, thereby reducing the search range of tracking.

And step 380, taking the area containing the three-dimensional coordinates as a candidate area of the tracking target in the next frame image.

The region including the three-dimensional coordinates obtained by the LSTM prediction is set as a candidate region for the tracking target in the next frame image. When the shielding occurs or the tracking target moves at an excessively high speed and the preset tracking model may lose the target, the prediction result of the LSTM is used as the tracking result, so that the whole tracking frame can continue to work, and the target is not paralyzed due to the loss of the preset tracking model.

In this embodiment, the continuous coordinate sequence of the tracking target calculated from the previous frames of images is used, and the LSTM model is used to obtain not only the subsequent coordinate sequence of the tracking target, but also the complete track of the tracking target according to the coordinate sequence. The target in the next frame image can also be tracked. Therefore, the searching range of tracking by using the preset tracking model can be reduced, and the defects that the preset tracking model can lose targets when the tracked targets are shielded or the moving speed is too high can be overcome.

In an embodiment, as shown in fig. 5, inputting the candidate region into a preset tracking model to process the candidate region to obtain a bounding box corresponding to the tracking target, includes:

and step 331, inputting the candidate region into a preset convolutional neural network model, and processing to obtain a bounding box of the tracking target in the image.

The preset tracking model comprises: presetting a convolutional neural network model and a regression layer. Specifically, a candidate area with 100 × 100 pixels extracted from the image is input into a preset convolutional neural network model for convolution operation. The convolutional layer in the convolutional neural network model is pre-trained using CaffeNet. The convolution operation may be a plurality of convolution operations to extract a feature map of the candidate region.

And a pooling layer is arranged above the convolution layer, and the feature map of the extracted image is input into the pooling layer to be pooled, namely, the feature is compressed to obtain a compressed feature map. In particular, the pooling layer may be a spatial pyramid pooling layer (spatial pyramid pooling layer) for retaining more location information.

The compressed feature map obtained by the pooling layer is passed through two full-connected layers, and the output 2500-dimensional vector is converted into a 50 × 50 matrix, i.e., a 50 × 50 probability map is output. Each element of the matrix is a probability value representing the probability that a pixel at a corresponding location in the input image belongs to the tracked object. For an image containing a tracked object, a connected region is typically output. The probability values within this region are significantly higher than the probability values outside. A bounding box can be calculated by thresholding the probability values that are within the bounding box above a certain probability value, and this bounding box serves as a prediction of the target location.

The method comprises the following steps of establishing a preset convolutional neural network model: acquiring a convolutional neural network training set for modeling, wherein the convolutional neural network training set comprises an image containing a target and an image not containing the target, and the image is acquired from a video containing the target; labeling the image, setting a value inside an actual bounding box of the target in the image as a first value, and setting a value outside the actual bounding box of the target in the image as a second value; inputting the convolutional neural network training set into a convolutional neural network of the initialized network parameters for training to obtain a bounding box of a target in an image; calculating network parameters of the modeled convolutional neural network according to the bounding box of the target in the image, the marked actual bounding box and the Softmax loss function; and obtaining a preset convolutional neural network model according to the network parameters.

The method comprises the following steps of establishing a preset regression layer: obtaining a regression layer training set for modeling, wherein the regression layer training set comprises images containing targets, and the images are obtained from videos containing the targets; labeling the image, and labeling the size of an actual bounding box of a target in the image; inputting the regression layer training set into a preset convolutional neural network model for training to obtain a bounding box of a target in an image; inputting the bounding box of the target in the image into a regression layer of the initialized network parameters for regression processing to obtain the size of the regressed bounding box corresponding to the target; calculating the network parameters of the regression layer after modeling according to the size of the regression bounding box corresponding to the target, the size of the marked actual bounding box and a smoothL1 loss function; and obtaining a preset regression layer according to the network parameters.

And 333, inputting the bounding box of the tracking target in the image into a preset regression layer to perform regression processing to obtain a regressed bounding box corresponding to the tracking target, wherein the preset regression layer comprises a low-layer convolution layer of a preset convolution neural network model.

The regression layer follows the pre-set convolutional neural network model. The regression layer is a low-layer convolution layer, an interest region pooling layer and a full-link layer in the preset convolution neural network model from bottom to top in sequence. And (3) obtaining a bounding box of the target by passing the candidate region of the target through a preset convolutional neural network model, projecting the bounding box to a low-layer convolutional layer in the preset convolutional neural network model, and performing convolution processing to obtain a characteristic diagram of the target.

Namely, the feature map of the target obtained in the previous step is input to the interest region pooling layer for feature compression, so as to obtain a compressed feature map. Specifically, the bounding box is clipped on the feature map of the lower convolution layer and scaled to a new feature map of 7 × 7.

Inputting the compressed feature map into a full-connection layer for processing to obtain the feature map, and calculating the displacement in the xy direction and the scaling of the length and the width between the bounding box and the regressed bounding box by using the CNN. And then, the regression bounding box is obtained according to the bounding box calculated by the CNN, the displacement in the xy direction and the scaling of the length and the width. Specifically, a full-link layer is added on the characteristic diagram, and considering that the position precision of the convolution layer cannot be too low, the embodiment of the invention chooses to cut on conv-1 (the first convolution layer). The output after the full connection layer processing is 4 real numbers, which represent the displacement and the scaling of the length and the width of the bounding box calculated by the regression bounding box and the CNN in the xy direction. The bounding box calculated by CNN is thus corrected and fine-tuned. And finally obtaining the bounding box corresponding to the tracking target.

In this embodiment, after the bounding box of the target obtained through the processing of the preset convolutional neural network model is input to the preset regression layer for regression processing, because the preset regression layer includes the lower convolutional layer of the preset convolutional neural network model, semantic information (target type, etc.) of the higher convolutional layer and position information of the lower convolutional layer can be considered at the same time, so that the target in the input image can be correctly identified and the bounding box of the target can be accurately given. Finally, the regression layer calculates the displacement and the scaling of the length and the width of the bounding box after regression and the bounding box calculated by CNN in the xy direction. Therefore, the bounding box calculated by the CNN is corrected, the regressed bounding box is more accurate, and the whole tracking frame is effectively prevented from generating drift and even losing the target.

In one embodiment, extracting a candidate region corresponding to a tracking target from an image includes: and extracting a candidate region corresponding to the tracking target from the image by adopting a background subtraction method.

In general, in the task of object tracking, a tracking algorithm will usually assume that an initial bounding box of an object to be tracked is already given in the first frame image. Therefore, when a tracking algorithm is actually used, how to obtain the initial bounding box, that is, how to detect the tracking target in the image, is the first to be solved. The embodiment of the invention firstly searches a possible area of the target table tennis. Because the camera is fixed all the time and the whole scene has no too many moving objects, the foreground region can be extracted by adopting the modes of background subtraction and the like, and the search range is narrowed. These regions are used as target candidate regions, and then are input into a preset tracking model for calculation. In particular, it is noted that this system does not rely heavily on background subtraction. It only needs to find the initial position of the target in the first few frames and then can use e.g. the LSTM model for target tracking. Of course, background subtraction can still be used as an aid in the body part of the algorithm to provide candidate regions for tracking.

In this embodiment, the search range can be quickly narrowed down by using the background subtraction method, so that the bounding box coordinates of the target in the next frame can be predicted according to the bounding box coordinates of the target in the previous frames of images by using the preset tracking model or the preset LSTM model.

In one embodiment, as shown in fig. 6, the process of building a camera projection matrix includes:

and step 610, respectively establishing a world coordinate system and a camera coordinate system.

A world coordinate system is established with one corner of the table as the origin. Wherein the x-axis is along the table bottom line, the y-axis is along the table side line, and the z-axis is perpendicular to the table. And establishing a camera coordinate system by taking the camera as an origin. Of course, the camera coordinate system may be established in other ways.

Step 630, obtain the intrinsic parameter matrix and the extrinsic parameter matrix of the camera.

Firstly, a chessboard calibration method is used, chessboard pictures are shot in multiple angles, and an internal parameter matrix M of a camera is obtained by using a built-in calibration function of OpenCV_3×3And a distortion coefficient. The internal parameter matrix is used for converting the three-dimensional coordinates of the camera coordinate system into two-dimensional coordinates of a camera plane:

and then, identifying a table tennis table area in the image through color features, and obtaining a boundary line of the table tennis table area by utilizing Hough transform. Coordinates of four corners of the table are obtained through intersection points of boundary lines, and then an external parameter matrix from the camera to the table is calculated, wherein the external parameter matrix comprises a rotation matrix R_3×3And a displacement matrix T_3x1. The extrinsic parameter matrix is used for the transformation of the camera coordinate system and the world coordinate system:

and 650, establishing a camera projection matrix according to the internal parameter matrix and the external parameter matrix, wherein the camera projection matrix can convert the two-dimensional coordinates of the camera coordinate system into the three-dimensional coordinates of the world coordinate system.

Finally, by combining the above two equations, the two-dimensional coordinates of a certain point on the two camera planes are known, and the three-dimensional coordinates are calculated by using the following equation.

Wherein Z is_cThe Z coordinate of the point in a camera coordinate system is unknown. u, v are the coordinates of the point in the camera plane. R is a rotation matrix and T is a displacement matrix. X_w,Y_w,Z_wIs the three-dimensional coordinate in the world coordinate system to be solved. For a total of four unknowns, two cameras each provide one of the above equations, and thus can be solved directly using linear algebra.

In the present embodiment, the camera projection matrix is calculated in advance so as to directly convert the two-dimensional coordinates of the camera coordinate system to the three-dimensional coordinates of the world coordinate system when performing the table tennis target tracking and trajectory prediction. The system is unified to a world coordinate system for calculation, so that the system is convenient and quick.

In one embodiment, as shown in FIG. 7, there is provided a table tennis target tracking and trajectory prediction apparatus 700, comprising: the tracking target tracking system comprises a camera shooting module 710, a tracking target candidate region weighting module 720, a tracking target bounding box acquisition module 730, a tracking target bounding box three-dimensional coordinate calculation module 740, a coordinate sequence generation module 750 and a tracking target track generation module 760.

The camera shooting module 710 is configured to obtain one frame of image shot by two cameras at the same time on the tracking target.

And the tracking target candidate region weighting module 720 is configured to extract a candidate region corresponding to the tracking target from the image.

And a tracked target bounding box obtaining module 730, configured to input the candidate region into a preset tracking model to be processed, so as to obtain a bounding box corresponding to the tracked target.

And the tracking target bounding box three-dimensional coordinate calculation module 740 is configured to obtain two-dimensional coordinates of the centers of bounding boxes corresponding to the tracking targets respectively photographed by the two cameras at the same time, and calculate three-dimensional coordinates of the centers of the bounding boxes of the tracking targets corresponding to the time according to the camera projection matrix.

And a coordinate sequence generation module 750, configured to obtain three-dimensional coordinates corresponding to bounding boxes at consecutive times, form a consecutive coordinate sequence, input the consecutive coordinate sequence into the recurrent neural network LSTM, and perform calculation to generate a subsequent coordinate sequence.

And a track generation module 760 for tracking the target, configured to obtain a track of the tracking target according to the continuous coordinate sequence and the subsequent coordinate sequence.

In one embodiment, as shown in fig. 8, a table tennis target tracking and trajectory prediction apparatus 700 further comprises: a three-dimensional coordinate prediction module 770 of the tracked target and a candidate region acquisition module 780 of the tracked target.

And a tracking target three-dimensional coordinate prediction module 770, configured to input the calculated three-dimensional coordinates of the bounding box center of the tracking target corresponding to the time into the LSTM for calculation, and predict the three-dimensional coordinates of the tracking target in the next frame of image captured by the two cameras.

And a candidate region of tracking target obtaining module 780, configured to use a region including three-dimensional coordinates as a candidate region of tracking target in the next frame image.

In one embodiment, as shown in fig. 9, the tracked target bounding box acquisition module 730 includes: a convolutional neural network module 731 and a regression layer module 733.

And the convolutional neural network module 731 is configured to input the candidate region into a preset convolutional neural network model, and process the candidate region to obtain a bounding box of the tracking target in the image.

The regression layer module 733, configured to input the bounding box of the tracking target in the image into a preset regression layer, and perform regression processing on the bounding box to obtain a regressed bounding box corresponding to the tracking target, where the preset regression layer includes a low-layer convolution layer of a preset convolutional neural network model.

In one embodiment, the tracking target candidate region weighting module 720 is further configured to: and extracting a candidate region corresponding to the tracking target from the image by adopting a background subtraction method.

In one embodiment, as shown in fig. 10, a table tennis target tracking and trajectory predicting apparatus 700 further comprises a camera projection matrix establishing module 790, the camera projection matrix establishing module 790 is used for establishing a world coordinate system and a camera coordinate system respectively; acquiring an internal parameter matrix and an external parameter matrix of a camera; and establishing a camera projection matrix according to the internal parameter matrix and the external parameter matrix, wherein the camera projection matrix can convert the two-dimensional coordinates of the camera coordinate system into the three-dimensional coordinates of the world coordinate system.

In one embodiment, there is also provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of: acquiring a frame of image shot by two cameras for a tracking target at the same moment; extracting a candidate region corresponding to a tracking target from the image; inputting the candidate area into a preset tracking model for processing to obtain a bounding box corresponding to a tracking target; acquiring two-dimensional coordinates of the centers of bounding boxes corresponding to the tracked targets respectively shot by the two cameras at the same moment, and calculating the three-dimensional coordinates of the centers of the bounding boxes of the tracked targets corresponding to the moment according to the projection matrix of the cameras; acquiring three-dimensional coordinates corresponding to bounding boxes at continuous moments to form a continuous coordinate sequence, inputting the continuous coordinate sequence into a recurrent neural network (LSTM) for calculation, and generating a subsequent coordinate sequence; and obtaining the track of the tracking target according to the continuous coordinate sequence and the subsequent coordinate sequence.

In one embodiment, the program further implements the following steps when executed by the processor: inputting the three-dimensional coordinates of the center of the bounding box of the tracking target corresponding to the calculated moment into an LSTM (local Scale TM) for calculation, and predicting the three-dimensional coordinates of the tracking target in the next frame of image shot by the two cameras; and taking the area containing the three-dimensional coordinates as a candidate area of the tracking target in the next frame of image.

In one embodiment, the program further implements the following steps when executed by the processor: inputting the candidate area into a preset convolutional neural network model, and processing to obtain a bounding box of a tracking target in the image; and inputting the bounding box of the tracking target in the image into a preset regression layer to perform regression processing to obtain a regressed bounding box corresponding to the tracking target, wherein the preset regression layer comprises a low-layer convolution layer of a preset convolution neural network model.

In one embodiment, the program further implements the following steps when executed by the processor: and extracting a candidate region corresponding to the tracking target from the image by adopting a background subtraction method.

In one embodiment, the program further implements the following steps when executed by the processor: respectively establishing a world coordinate system and a camera coordinate system; acquiring an internal parameter matrix and an external parameter matrix of a camera; and establishing a camera projection matrix according to the internal parameter matrix and the external parameter matrix, wherein the camera projection matrix can convert the two-dimensional coordinates of the camera coordinate system into the three-dimensional coordinates of the world coordinate system.

In one embodiment, there is also provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring a frame of image shot by two cameras for a tracking target at the same moment; extracting a candidate region corresponding to a tracking target from the image; inputting the candidate area into a preset tracking model for processing to obtain a bounding box corresponding to a tracking target; acquiring two-dimensional coordinates of the centers of bounding boxes corresponding to the tracked targets respectively shot by the two cameras at the same moment, and calculating the three-dimensional coordinates of the centers of the bounding boxes of the tracked targets corresponding to the moment according to the projection matrix of the cameras; acquiring three-dimensional coordinates corresponding to bounding boxes at continuous moments to form a continuous coordinate sequence, inputting the continuous coordinate sequence into a recurrent neural network (LSTM) for calculation, and generating a subsequent coordinate sequence; and obtaining the track of the tracking target according to the continuous coordinate sequence and the subsequent coordinate sequence.

In one embodiment, the processor further implements the following steps when executing the computer program: inputting the three-dimensional coordinates of the center of the bounding box of the tracking target corresponding to the calculated moment into an LSTM (local Scale TM) for calculation, and predicting the three-dimensional coordinates of the tracking target in the next frame of image shot by the two cameras; and taking the area containing the three-dimensional coordinates as a candidate area of the tracking target in the next frame of image.

In one embodiment, the processor further implements the following steps when executing the computer program: inputting the candidate area into a preset convolutional neural network model, and processing to obtain a bounding box of a tracking target in the image; and inputting the bounding box of the tracking target in the image into a preset regression layer to perform regression processing to obtain a regressed bounding box corresponding to the tracking target, wherein the preset regression layer comprises a low-layer convolution layer of a preset convolution neural network model.

In one embodiment, the processor further implements the following steps when executing the computer program: and extracting a candidate region corresponding to the tracking target from the image by adopting a background subtraction method.

In one embodiment, the processor further implements the following steps when executing the computer program: respectively establishing a world coordinate system and a camera coordinate system; acquiring an internal parameter matrix and an external parameter matrix of a camera; and establishing a camera projection matrix according to the internal parameter matrix and the external parameter matrix, wherein the camera projection matrix can convert the two-dimensional coordinates of the camera coordinate system into the three-dimensional coordinates of the world coordinate system.

It will be understood by those skilled in the art that all or part of the processes in the methods of the embodiments described above may be implemented by hardware related to instructions of a computer program, and the program may be stored in a non-volatile computer readable storage medium, and in the embodiments of the present invention, the program may be stored in a storage medium of a computer system and executed by at least one processor in the computer system, so as to implement the processes including the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A table tennis target tracking and trajectory prediction method, the method comprising:

inputting the bounding box of the tracking target in the image into a preset regression layer to perform regression processing, and obtaining a regressed bounding box corresponding to the tracking target, wherein the preset regression layer comprises a low-layer convolution layer of the preset convolution neural network model;

2. The method of claim 1, further comprising:

3. The method according to claim 1, wherein the extracting the candidate region corresponding to the tracking target from the image comprises:

4. The method of claim 1, wherein the process of building the camera projection matrix comprises:

5. A table tennis target tracking and trajectory prediction apparatus, the apparatus comprising:

the tracking target bounding box acquisition module is used for inputting the candidate region into a preset convolutional neural network model and processing the candidate region to obtain a bounding box of a tracking target in the image; inputting the bounding box of the tracking target in the image into a preset regression layer to perform regression processing, and obtaining a regressed bounding box corresponding to the tracking target, wherein the preset regression layer comprises a low-layer convolution layer of the preset convolution neural network model;

6. The apparatus of claim 5, further comprising:

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a table tennis target tracking and trajectory prediction method according to any one of claims 1 to 4.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements a table tennis target tracking and trajectory prediction method according to any one of claims 1 to 4.