Disclosure of Invention
The invention aims to provide a target tracking method based on a pyramid twin network, which solves the technical problems of low target identification accuracy, low tracking speed and low robustness of the existing tracking algorithm.
A target tracking method based on a pyramid twin network, the tracking method comprising the steps of:
step 1: building a target tracking system platform;
step 2: constructing a pyramid twin network model;
step 3: training a pyramid twin network model;
step 4: testing a pyramid twin network model;
step 5: and transmitting the trained and tested pyramid twin network model to a target tracking system platform to realize target tracking.
The system comprises a target tracking system platform, a model training system and an embedded tracking system, wherein the model training system is connected with the embedded tracking system through a serial port, a network cable, USB (universal serial bus) or WIFI (wireless fidelity), the model training system comprises a server, a camera and a display, the server is connected to the display through a VGA (video graphics array) data line, the server is connected to the camera through a USB data line, the server is connected to the Internet through the network cable or WIFI (wireless fidelity) and is used for acquiring data of a training set and a data tag, and after the pyramid twin network model is trained and tested, the trained pyramid twin network model and parameters are transmitted to the embedded tracking system through the serial port, the network cable, the USB or the WIFI;
the embedded tracking system comprises a power supply module, a controller module, a camera, a data storage module, a motor module, a key module, a remote controller, a wireless module, a voice module, an LCD module, an LED module and a USB module, wherein the remote controller is connected with the embedded tracking system through the wireless module, the power supply module consists of an AC-DC circuit and a voltage stabilizer and is used for outputting different voltage power supplies, the controller module is used for data processing, the data storage module is used for storing an operating system, model data and parameter data, the motor module comprises 3 motors and 3 motor driving modules and is used for rotating, pitching and rolling the camera, the camera is used for tracking the target in real time, the key module is used for setting the working mode of the tracking system, the target tracking mode is switched into by clicking, the target tracking mode is cancelled by clicking, the remote controller is used for wirelessly controlling the embedded tracking system, the tracking mode is switched, the recording and the shutdown operation are carried out, the voice module comprises a module and a voice playing module and is used for broadcasting voice, the LCD module is used for displaying data processed by the controller module, the parameter setting, the LED module and the parameter setting are used for displaying the LED module is used for displaying the abnormal, the LED is used for displaying the abnormal, the current flash state is used for displaying and the abnormal, and the USB is used for displaying the abnormal, and the current state is used for displaying and the abnormal, and the current when the USB is used for displaying the running.
Further, the pyramid twin network model in the step 2 is composed of a twin network, a feature pyramid network and a classified positioning parallel network, the twin network is composed of two subnets composed of VGGs, the subnets composed of two VGGs share the same parameters, the subnets composed of two VGGs are used for respectively extracting features of a target image and a search image, after the twin network finishes feature extraction of the target image and the search image, target feature layers and search feature layers with different scales are respectively obtained, and 6 layers of features are extracted from the feature layers with different levels and different scales to construct the pyramid network;
after the feature pyramid network is constructed, the feature pyramid network is combined with a classification positioning parallel network for real-time positioning and tracking of a target, the classification positioning parallel network is composed of a candidate frame subnet, a classifier subnet and a positioning regression subnet, the candidate frame subnet, the classifier subnet and the positioning regression subnet respectively generate a candidate frame, a confidence coefficient and a coordinate offset, and the classifier subnet and the positioning regression subnet are executed in parallel.
Further, the two subnets formed by VGGs are a target subnet and a search subnet, feature extraction is performed on the target image and the search image respectively, the target subnet and the search subnet share the same weight and bias, each of the target subnet and the search subnet is formed by eleven layers of convolution layers, and the eleven layers of convolution layers are respectively: the first layer is composed of 2 convolution units, the second layer is composed of 2 convolution units, the third layer is composed of 3 convolution units, the fourth layer is composed of 3 convolution units, the fifth layer is composed of 3 convolution units, the sixth layer is composed of 1 convolution unit, the seventh layer is composed of 1 convolution unit, the eighth layer is composed of 2 convolution units, the ninth layer is composed of 2 convolution units, the tenth layer is composed of 2 convolution units, and the eleventh layer is composed of 2 convolution units;
the feature pyramid network is composed of a target subnet and feature layers obtained by searching the subnet, wherein the total number of layers is 6, and the feature pyramid network is respectively: the first layer is formed by an eleventh characteristic layer in the target subnet; the second layer is formed by a tenth layer of characteristic layer in the searching sub-network; the third layer is formed by a seventh characteristic layer in the target subnet; the fourth layer is formed by a sixth layer of characteristic layers in the searching sub-network; the fifth layer is composed of a fourth characteristic layer in the target subnet; the sixth layer is formed by the third feature layer in the search sub-network,
the classifying and positioning parallel network consists of a candidate frame sub-network, a classifier sub-network and a positioning regression sub-network, wherein the candidate frame sub-network consists of candidate frames for predicting the target capacity, the classifier sub-network consists of a normalized exponential function classifier for distinguishing the target from the non-target capacity, the positioning regression sub-network consists of a 3x3 convolution kernel for positioning the target, each layer of image feature is divided into n x n grids by the candidate frames, n is a positive integer, each grid generates 6 candidate frames with fixed size, and the candidate frames respectively generate confidence and coordinate offset through the classifier sub-network and the positioning regression sub-network.
Further, the specific process of the step 3 is as follows:
acquiring an original video sequence from a video database, performing image preprocessing on the video sequence to acquire a target training set and a search training set, wherein targets of the training set are all at the center position of an image;
after the training set image processing is completed, paired target training set and search training set pictures are input to a subnet corresponding to a twin network to obtain a target feature layer and a search feature layer, feature layers with different layers and different scales are extracted to construct a pyramid network, and meanwhile, candidate frames with different positions and different sizes are constructed in each feature layer of the pyramid network based on a candidate frame size formula and a position formula;
inputting each layer of feature layer of the pyramid network into a classified positioning parallel network to obtain an output result of the parallel network, and performing similarity matching on the output result and a tag true value to obtain a positive sample and a negative sample;
calculating an error between a matching result and a tag true value by using a target loss function, reversely transmitting the error to an input layer by layer, adjusting weights and offsets in a network based on a small-batch random gradient descent optimization algorithm, acquiring an optimal error value, and completing one-time network model training;
repeating the steps until the error value of the target loss function converges to the minimum value.
Further, the size formula of the candidate frame is:
wherein ,sk Representing the size, s, of a k-layer image feature candidate box min =0.2 indicates the size of the first layer image feature candidate frame, s max =0.95 denotes the size of the sixth layer image feature candidate frame, m denotes the feature layer number,
the position formula is:
wherein ,
representing the width, ++of the candidate box>
Representing the height of the candidate box, a
r Represents a proportionality coefficient, and->
If a is
r =1, the size of the candidate box is +.>
And the center of each candidate box is set to: />
wherein |f
k I represents the size of the k-th layer feature map, i ε [0, |f
k |]And combining the image characteristics, and predicting targets with different sizes and different shapes by using candidate frames with different sizes and different proportion coefficients.
The similarity matching refers to a true value (O) and a candidate frame (I), and is based on the formula
A result is calculated, wherein the result is regarded as a positive sample if the result is greater than or equal to a threshold value of 0.5, and is regarded as a negative sample if the result is less than the threshold value of 0.5, and xi represents a coefficient of matching the ith candidate frame with a true value, when Σ
i x
i When the number is more than or equal to 1, a plurality of candidate frames are matched with the true value;
the target loss function L (x, c, p, t) refers to a confidence loss function L conf (x, c) and a coordinate loss function L loc The weighted summation of (x, p, t) is shown in the formula (3):
confidence loss function L in the above (3) conf (x, c) formula such as (4)The following is shown:
the coordinate loss function L in the above (3) loc The formula (x, p, t) is shown as formula (5):
wherein x represents a matching coefficient, c represents a confidence level, p represents a prediction frame, t represents a real frame, N represents a positive sample number, i represents an ith candidate frame, alpha represents a scale parameter between a confidence loss function and a coordinate loss function,
target representing foreground, ++>
Target representing background,/->
K in (c) represents a k-th layer feature map, m represents m-th layer feature layers for prediction,
representing the offset value of the real frame relative to the candidate frame, (cx, cy) representing the center coordinates of the real frame, w and h representing the width and height, respectively, and d representing the candidate frame;
the small batch random gradient descent method is an algorithm for optimizing the loss function of the original model, and the formula is shown as the formula (6):
where θ represents the parameter that needs to be updated, b represents the number of samples needed for each iteration, i=1, 2, …, b represents the number of samples,j=0, 1 denotes a feature number, α denotes a learning rate, y θ (x (i) ) Representing a hypothetical function, x (i) Representing the ith sample feature, y (i) Representing the output corresponding to the i-th sample.
Further, the specific process of the step 4 is as follows:
step 4.1: respectively inputting a first frame and a second frame in an original video sequence into a target subnet and a twin subnet in the trained twin network;
step 4.2: inputting output results of the target subnetwork and the twin subnetwork into a pyramid network which is obtained through training so as to obtain candidate frames of each feature layer in the pyramid network;
step 4.3: inputting the feature layer and the candidate box of the pyramid network into a classified positioning parallel network to obtain a test result;
step 4.4: inputting the test result into a target subnet of the twin network, and inputting the next frame in the original video sequence into a search subnet of the twin network;
(5) And 4.2, repeating the steps 4.2 to 4.4, and realizing real-time positioning and tracking of the target.
Further, the specific process of the step 5 is as follows:
step 5.1: the parameters of the trained model and the pyramid twin network model are transmitted and stored into a data storage module in the embedded tracking system through a network cable or USB;
step 5.2: starting a camera, collecting a test video, and storing the test video in a data storage module;
step 5.3: operating a pyramid twin network model algorithm, automatically calling model parameters stored in a data storage module, and carrying out real-time processing on video data streams stored in the data storage module;
step 5.3: according to the processing result, the target position is automatically framed in the video stream, and 3 motors are controlled to perform corresponding rotation, pitching and rolling motions by combining a PID control algorithm (which is an existing control algorithm such as a fine tuning method, etc.), so that the camera can stably and real-timely track and position the target.
The invention adopts the technical proposal and has the following technical effects:
(1) The invention provides a novel feature pyramid twin network model, which has the characteristics of weight sharing and feature parallel extraction, so that the real-time performance of the pyramid twin network model in target tracking can be improved.
(2) The invention adopts the characteristics of different subnets, different levels and different scales to construct a brand new characteristic pyramid network, so that the invention has stronger robustness in complex environments.
(3) The invention constructs a classified positioning parallel network, which respectively generates a candidate frame, confidence coefficient and coordinate offset, and the target tracking speed and precision of the invention can be improved because the 3 sub-networks of the network are executed in parallel and the output result contains the coordinate correction value.
(4) The invention applies the target tracking algorithm to the embedded tracking system platform on the basis of combining the PID control algorithm, can stably control 3 motors to perform corresponding rotation, pitching and rolling motions, realizes 3-dimensional tracking and positioning of the target by the camera, solves the defects of poor motion stability, less motion dimension and the like of the current target tracking platform, and has higher practical value and market prospect.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below by referring to the accompanying drawings and by illustrating preferred embodiments. It should be noted, however, that many of the details set forth in the description are merely provided to provide a thorough understanding of one or more aspects of the invention, and that these aspects of the invention may be practiced without these specific details.
Referring to fig. 1-4, the invention provides a target tracking method based on pyramid twin network, comprising the following steps:
1. construction of target tracking system platform
The target tracking system platform mainly comprises a model training system and an embedded tracking system. The model training system is connected with the embedded tracking system through serial ports, network cables, USB and WIFI to realize wired and wireless data communication, and the specific connection mode is shown in figure 4.
wherein :
(1) The model training system comprises a server, a camera and a display. Firstly, a server is connected to a display through a VGA data line to realize man-machine interaction and algorithm debugging; then, the server installs development software such as Python, CUDN and TensorFlow based on Ubuntu operation system to realize the construction of development environment of target tracking algorithm based on pyramid twin network; connecting a server to a camera through a USB data line to acquire a test data set required by a target tracking algorithm; meanwhile, the server is connected to the Internet through a network cable and WIFI to acquire data such as a training set and a data tag; finally, the target tracking algorithm is operated based on the development environment, and after training and testing, the trained pyramid twin network model and parameters are transmitted to the embedded tracking system through a serial port, a network cable, USB and WIFI.
(2) The embedded tracking system comprises a power module, an MCU, a camera, a data storage module, a motor module, a key module, a remote controller, a wireless module, a voice module, an LCD, an LED and a USB module. Specific examples will now be described in detail. Firstly, pressing a start button of a remote controller, starting an embedded tracking system, starting a main program to run, reading each parameter and model, and performing corresponding processing according to the parameters; initializing each module of the system; after initialization is completed, the LCD displays related information, detects whether the system and each module have abnormal conditions, and if not, the LED is always on, otherwise, the LED is flashed; next, the main program waits for the arrival of a task, and can execute a target tracking algorithm, a PID control algorithm, a camera processing, a data storage processing, a motor processing, a key processing, a remote controller processing, a wireless communication processing, a voice processing, an LCD processing, an LED processing, and a USB processing according to different tasks. If the remote controller or the key presses the tracking mode key, the main program executes LED slow flashing and runs target tracking processing, namely the target tracking algorithm based on the pyramid twin network provided by the invention is run, and in the execution of the target tracking processing, the video stream acquired by the camera is processed in real time to acquire the position of the target; after the target position is obtained, triggering the main program to execute motor control processing, namely controlling 3 motors to perform corresponding rotation, pitching and rolling motions based on a PID control algorithm, so that the camera can stably and real-timely track and position the target.
Fig. 3 is a main program workflow diagram of the embedded tracking system of the present invention. The process starts in step S301, then step S302 is executed to read parameters and process the parameters, then step S303 is executed to initialize the system and each module, after the initialization is completed, the system executes step S304 to determine whether the system has an abnormal condition, if yes, step S305 is executed to determine which conditions have an abnormal condition, record the corresponding conditions, and process the same. If there is no abnormality, the program execution step S306 waits for the arrival of the trigger task instruction, and since the program is run under the embedded Linux operating system and designed in a modularized manner, the main program only needs to wait for the arrival of the task, and when the task arrives, the corresponding operation is executed, so that the program is concise and real-time more enhanced, and the target tracking algorithm, the PID control algorithm, the camera processing, the data storage processing, the motor processing, the key processing, the remote controller processing, the wireless communication processing, the voice processing, the LCD processing, the LED processing and the USB processing can be executed according to different tasks.
2. Model construction and training
1. Constructing a pyramid twin network model:
(1) The pyramid twin network model constructed by the invention mainly comprises a twin network, a feature pyramid and a classification positioning parallel network, as shown in figure 1, and the model runs in a server of the platform. The twin network consists of two sub-networks formed by VGGs, the twin network shares the same parameters, and the two sub-networks respectively extract the characteristics of the target image and the search image.
(2) After feature extraction is completed by the twin network, target feature layers and search feature layers with different scales are respectively obtained, and 6 layers of features are extracted from the feature layers with different levels and different scales to construct a pyramid network. The pyramid network is fused with the feature layers with different levels, so that the pyramid network has translation and scale invariance, and the target tracking capability of the invention in complex application occasions is improved.
(3) After the feature pyramid network is constructed, the feature pyramid network is combined with the classified positioning parallel network, so that real-time positioning and tracking of targets are realized. The classification positioning parallel network consists of a candidate frame subnet, a classifier subnet and a positioning regression subnet, wherein the 3 subnets respectively generate candidate frames, confidence coefficient and coordinate offset, and the classifier subnet and the positioning regression subnet are executed in parallel, so that the execution speed of an algorithm is further improved. Finally, through the output results of the 3 subnets, the real-time positioning and tracking of the target are realized.
Wherein: the twin network is composed of two subnets formed by VGGs, which are respectively called a target subnet and a search subnet, and respectively extract characteristics of a target image and a search image and share the same weight and bias. Where VGG is the base network of the twin network, which is made up of eleven convolutional layers. The eleven convolution layers are respectively: the first layer conv1 is made up of 2 convolutions conv1_1 and conv1_2 of 224x224x 64; the second convolution layer conv2 is formed by 2 convolutions conv2_1 and conv2_2 of 112x112x 128; the third layer conv3 is composed of 3 convolutions conv3_1, conv3_2 and conv3_3 of size 56x56x 256; the fourth convolution layer conv4 is composed of 3 convolutions conv4_1, conv4_2 and conv4_3 of size 28x28x 512; the fifth layer of convolutions conv5 is made up of 3 convolutions conv5_1, conv5_2 and conv5_3 of size 14x14x 512; the sixth convolution layer is formed by convolution conv6 with the size of 3x3x 1024; the seventh convolution layer is formed by convolution conv7 with the size of 1x1x 1024; the eighth convolution layer conv8 is made up of 2 convolutions conv8_1 of size 1x1x256 and 3x3x512 conv8_2; the ninth convolution layer conv9 is made up of 2 convolutions conv9_1 of size 1x1x128 and 3x3x256 conv9_2; the tenth convolution layer conv10 is made up of 2 convolutions conv10_1 of size 1x1x128 and 3x3x256 conv10_2; the eleventh convolution layer conv11 is made up of 2 convolutions conv11_1 of size 1x1x128 and 3x3x256 conv11_2. The specific parameters are shown in table 1:
TABLE 1
The feature pyramid network is composed of the feature layers obtained by the target sub-network and the search sub-network, the total layer number is 6, and the feature pyramid network comprises the following layers: the first layer is composed of Conv11_2 in the target subnet; the second layer is composed of Conv10_2 in the search subnet; the third layer is composed of Conv7 in the target subnet; the fourth layer is composed of Conv6 in the search subnet; the fifth layer is composed of Conv4_3 in the target subnet; the sixth layer consists of conv3_3 in the search subnet.
The classification positioning parallel network consists of a candidate subnet, a classifier subnet and a positioning regression subnet. Wherein the candidate box sub-network is composed of candidate boxes, which have the ability to predict the target. The classifier subnetwork is composed of softmax classifiers that have the ability to distinguish between target and non-target. The localization regression subnetwork is composed of a 3x3 convolution kernel that is capable of localizing objects. The mapping box is formed by dividing each layer of feature map into n x n grids, each grid generates 6 candidate frames with fixed size, and confidence and coordinate offset are respectively generated by the candidate frames through the classifier subnet and the positioning regression subnet. Where confidence is the score of the target and the coordinate offset is the relative distance between the bounding box and the moment feature map location.
2. Training a pyramid twin network model:
(1) The method comprises the steps of obtaining an original video sequence from a video database in the ILSVRC, and performing image processing on the video sequence to obtain a target training set and a search training set. The size of the target training set image is 127x127x3, the size of the searching training set image is 255x255x3, and the targets of the training set are all at the center of the image;
(2) After the training set image processing is completed, inputting paired training set pictures into a subnet corresponding to a twin network to obtain a target feature layer and a search feature layer, extracting the feature layers with different layers and different scales to construct the pyramid network, and simultaneously, constructing candidate frames with different positions and different sizes in each feature layer of the pyramid network based on a candidate frame size formula and a position formula;
(3) And inputting each layer of characteristic layer of the pyramid network into the classification positioning parallel network to obtain an output result of the parallel network, and performing similarity matching on the output result and a tag true value to obtain a positive sample and a negative sample.
(4) Calculating the error between the matching result and the tag true value by using a target loss function, reversely spreading the error to an input layer by layer, and simultaneously adjusting the weight and bias in the network based on a small batch of random gradient descent optimization algorithm to obtain an optimal error value, thereby completing one-time network model training;
(5) Repeating the steps until the error value of the target loss function converges to the minimum value.
wherein :
the size formula of the candidate frame is shown as the formula (1):
wherein ,sk Representing the size of a k-layer feature map candidate box, s min =0.2 indicates the size of the first layer feature map candidate box, s max =0.95 denotes the size of the sixth layer feature map candidate box, and m denotes the number of feature layers.
The position formula is shown as the formula (2): :
wherein ,
represents the width of the binding box, < +.>
Representing the height of the binding box, a
r Represents a scale factor, an
If a is
r =1, the size of the candidate box is +.>
And the center of each candidate box is set to: />
wherein |f
k I represents the size of the k-th layer feature map, i ε [0, |f
k ]. Combining feature map, the scaling boxes with different sizes and different proportion coefficients predict targets with different sizes and different shapes. And in the predicted results (predictors), sorting the adjacent boxes according to the confidence size, and finally selecting predictors with high negative-sample confidence, wherein the ratio of adjacent boxes to positive is 3:1.
The similarity matching refers to a group trunk (O) and a grouping box (I), and is performed according to the formula
And (3) a calculated result. Wherein the result is considered to be a positive sample if the result is greater than or equal to the threshold value of 0.5, and a negative sample if the result is less than the threshold value of 0.5. Meanwhile, the invention uses xi to represent the coefficient matched with the group trunk by the ith sounding box, when the sum is
i x
i And when the number is more than or equal to 1, a plurality of sounding boxes are matched with the group trunk.
The target loss function L (x, c, p, t) refers to a confidence loss function L conf (x, c) (confidence loss) function L loc The weighted summation of (x, p, t) (localization loss) is shown in the formula (3):
confidence loss function L in the above (3) conf The formula (x, c) is shown as the formula (4):
the coordinate loss function L in the above (3) loc The formula (x, p, t) is shown as formula (5):
where x denotes the matching coefficient, c denotes the confidence, p denotes the prediction box, t denotes the true box, N denotes the number of positive samples, i denotes the i-th bridging box, alpha denotes the scale parameter between confidence loss and localization loss,
confidence, representing the foreground (object)>
Confidence, indicative of background>
K in (a) represents a k-th layer feature map, m represents m-th layer feature layer prediction, and ++>
Representing the offset value of the real frame relative to the candidate frame, (cx, cy) representing the center coordinates of the real frame, w and h representing the width and height, respectively, and d representing the bounding box. The small batch random gradient descent method is an algorithm for optimizing the loss function of the original model, and the formula is shown as the formula (6):
where θ represents a parameter to be updated, b represents the number of samples required for each iteration, i=1, 2, …, b represents the number of samples, j=0, 1 represents a feature number, α represents a learning rate, y θ (x (i) ) Representing a hypothetical function, x (i) Representing the ith sample feature, y (i) Representing the output corresponding to the i-th sample.
3. Testing pyramid twin network model:
the trained pyramid twin network model is utilized to carry out target tracking test, as shown in fig. 2, and the specific steps are as follows:
(1) Respectively inputting a first frame and a second frame in an original video sequence in an ILSVRC into a target subnet and a twin subnet in the twin network which are obtained through training by the invention;
(2) Inputting the output results of the two sub-networks into the pyramid network which is trained by the method to obtain candidate frames of each feature layer in the pyramid network;
(3) Inputting each layer of characteristic layer of the pyramid network into the classification positioning parallel network to obtain a test result;
(4) Inputting the test result into a target subnet of the twin network, and inputting the next frame in the original video sequence into a search subnet of the twin network;
(5) Repeating the steps (2) to (4), thereby realizing the real-time positioning and tracking of the target.
3. Target tracking based on embedded tracking system
The trained pyramid twin network model parameters are transmitted and stored into a data storage module in an embedded tracking system through a network cable or USB, and then the actual application of target tracking is carried out based on the model parameters, and the specific steps are as follows:
(1) Transplanting the embedded Linux operating system into an embedded tracking system hardware platform;
(2) The trained model parameters and the pyramid twin network model are transmitted and stored into a data storage module in the embedded tracking system through a network cable or USB;
(3) Starting a camera, collecting a test video, and storing the test video in a data storage module;
(4) The pyramid twin network model algorithm provided by the invention is operated, model parameters stored in the data storage module are automatically called, and video data streams stored in the data storage module are processed in real time;
(5) According to the processing result of the algorithm, the target position is automatically framed in the video stream, and 3 motors are controlled by combining with the PID control algorithm to perform corresponding rotation, pitching and rolling motions, so that the camera can stably track and position the target in real time.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.