CN110796093A

CN110796093A - Target tracking method and device, computer equipment and storage medium

Info

Publication number: CN110796093A
Application number: CN201911044271.6A
Authority: CN
Inventors: 周康明; 朱月萍
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-14

Abstract

The application relates to a target tracking method, a target tracking device, a computer device and a storage medium. The method comprises the following steps: acquiring a video source; the video source comprises at least K +1 frames of images, and the positions of target frames are marked on the continuous K frames of images of the video source; respectively inputting the positions of the target frames marked on the continuous K frames of images into a preset prediction network, and determining the positions of the target frames on the next frame of images of the continuous K frames; the preset prediction network is obtained by training the position of a sample target frame in a sample image; and according to the position of the prediction frame, intercepting an image corresponding to the position of the prediction frame on the next frame image of the continuous K frames, and inputting the intercepted image into a preset tracking model to obtain the tracking position of the target on the next frame image of the continuous K frames. By adopting the method, the calculation efficiency and the tracking accuracy in the tracking process can be improved.

Description

Target tracking method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a target tracking method and apparatus, a computer device, and a storage medium.

Background

With the rapid development of social economy, the living standard of people is continuously improved, and the popularity of vehicles as one of the most convenient transportation means for people to go out is higher and higher, however, the number of the accompanying vehicle traffic violations is more and more, and some violations even seriously affect the normal operation of road traffic, so that the vehicles need to be accurately tracked to more accurately judge whether the vehicles have violations.

The vehicle tracking means that the size and the position of a target vehicle in a subsequent frame are predicted under the condition of the size and the position of the target vehicle in an initial frame of a given video sequence, and during tracking, an initial frame image is usually used as a template image, and full-range detection is carried out on the subsequent frame image by using the template image so as to complete the tracking of the target vehicle.

However, the tracking process has the problem of being too computationally intensive, resulting in inefficient computation.

Disclosure of Invention

In view of the above, it is necessary to provide a target tracking method, an apparatus, a computer device, and a storage medium capable of improving the calculation efficiency.

A method of target tracking, the method comprising:

acquiring a video source; the video source comprises at least K +1 frames of images, and the positions of target frames are marked on the continuous K frames of images of the video source;

respectively inputting the positions of the target frames marked on the continuous K frames of images into a preset prediction network, and determining the positions of the target frames on the next frame of images of the continuous K frames; the preset prediction network is obtained by training the position of a sample target frame in a sample image;

and according to the position of the prediction frame, intercepting an image corresponding to the position of the prediction frame on the next frame image of the continuous K frames, and inputting the intercepted image into a preset tracking model to obtain the tracking position of the target on the next frame image of the continuous K frames.

In one embodiment, the capturing, according to the position of the prediction frame, an image corresponding to the position of the prediction frame on a frame image subsequent to the consecutive K frames, and inputting the captured image into a preset tracking model includes:

amplifying the predicted frame position according to a preset proportion to obtain an amplified frame position;

and intercepting an image corresponding to the position of the amplification frame on the next frame image of the continuous K frames, and inputting the image corresponding to the position of the amplification frame into a preset tracking model.

In one embodiment, the predetermined ratio is a ratio greater than 1 and less than 2.

In one embodiment, the step of inputting the captured image into the preset tracking model to obtain the tracking position of the target on the next frame image of the consecutive K frames includes:

according to the position of a target frame marked on a Kth frame image in continuous K frames of a video source, capturing an image corresponding to the marked position of the target frame on the Kth frame image in the continuous K frames, and inputting the image corresponding to the marked position of the target frame on the Kth frame image in the continuous K frames into a template branch network for feature extraction to obtain the feature of the target;

and inputting an image corresponding to the position of the intercepted prediction frame on the next frame image of the continuous K frames into the detection branch network, and obtaining the tracking position of the target on the next frame image of the continuous K frames by combining the characteristics of the target output by the template branch network.

In one embodiment, the method for training the preset predictive network includes:

acquiring a sample video source; the sample video source comprises at least N +1 frames of sample images, and the positions of target frames of the sample images of the sample video source are marked;

and taking the previous N frames of images of the sample video source as the input of the initial prediction network, taking the position of a prediction frame of a target on the (N + 1) th frame of image as the output of the initial prediction network, and training the initial prediction network to obtain the prediction network.

In one embodiment, the training of the initial prediction network with the N consecutive frame images of the sample video source as the input of the initial prediction network and the prediction frame position of the target on the next frame image of the N consecutive frame images as the output of the initial prediction network to obtain the prediction network includes:

inputting the continuous N frames of images of the sample video source into an initial prediction network to obtain the position of a prediction frame of a target on a next frame of image of the continuous N frames of images;

inputting the position of a prediction frame of a target on a frame image next to the continuous N frame images and the position of a marked target frame on the frame image next to the continuous N frame images into a preset loss function to obtain a value of the loss function;

and adjusting parameters of the initial prediction network according to the value of the loss function until the value of the loss function reaches a preset standard value, so as to obtain the prediction network.

In one embodiment, the determining the position of the target frame on the next frame image of the consecutive K frames by inputting the positions of the target frames marked on the consecutive K frame images into the preset prediction network includes:

respectively inputting the positions of the target frames marked on the continuous K frames of images into an activation layer for processing to obtain activation information corresponding to each frame of image;

and respectively inputting the activation information corresponding to each frame of image into a prediction layer to carry out full-connection convolution processing, so as to obtain the position of a prediction frame of the target on the next frame of image of the continuous K frames.

An object tracking apparatus, the apparatus comprising:

the acquisition module is used for acquiring a video source; the video source comprises at least K +1 frames of images, and the positions of target frames are marked on the continuous K frames of images of the video source;

the prediction module is used for respectively inputting the positions of the target frames marked on the continuous K frames of images into a preset prediction network and determining the positions of the target frames on the next frame of images of the continuous K frames; the preset prediction network is obtained by training the position of a sample target frame in a sample image;

and the determining module is used for intercepting an image corresponding to the position of the prediction frame on the next frame image of the continuous K frames according to the position of the prediction frame, inputting the intercepted image to a preset tracking model, and obtaining the tracking position of the target on the next frame image of the continuous K frames.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

According to the target tracking method, the target tracking device, the computer equipment and the storage medium, a video source is obtained, the video source comprises at least K +1 frames of images, the positions of target frames are marked on the continuous K frames of the video source, the positions of the target frames marked on the continuous K frames of images are respectively input into a preset prediction network, the position of a target on a prediction frame on the next frame of images of the continuous K frames is determined, images corresponding to the positions of the prediction frames are intercepted on the next frame of images of the continuous K frames according to the positions of the prediction frames, and the intercepted images are input into a preset tracking model, so that the tracking position of the target on the next frame of images of the continuous K frames is obtained. In the method, because the image which is input into the tracking model for tracking at last is intercepted on the next frame image of the continuous K frame images according to the position of the prediction frame instead of the next frame whole image of the continuous K frame, the size of the intercepted image is smaller relative to the size of the next frame whole image of the continuous K frame, therefore, when the intercepted image is used for target tracking, the calculation amount of the tracking process can be reduced, the time consumption of the tracking model is reduced, the calculation efficiency of the tracking process can be improved, and the performance of the tracking model is improved; in addition, the predicted frame position in the method is predicted in the prediction network through the target frame positions of the continuous K-frame images, so that the predicted frame position obtained by the method is more accurate, and the target tracking accuracy can be improved.

Drawings

FIG. 1 is a diagram illustrating an internal structure of a computer device according to an embodiment;

FIG. 2 is a schematic flow chart diagram of a target tracking method in one embodiment;

FIG. 3 is a schematic flow chart diagram of a target tracking method in another embodiment;

FIG. 4a is a schematic flow chart diagram illustrating a target tracking method in accordance with another embodiment;

FIG. 4b is a schematic diagram of a tracking network in another embodiment;

FIG. 5 is a flowchart illustrating a method for training a predictive network according to another embodiment;

FIG. 6 is a diagram illustrating a specific training flow of a training method for a predictive network according to another embodiment;

FIG. 7a is a schematic flow chart diagram illustrating a target tracking method in accordance with another embodiment;

FIG. 7b is a schematic structural diagram of a prediction network in the target tracking method according to another embodiment;

FIG. 7c is a schematic diagram illustrating the calculation of a prediction network in the target tracking method according to another embodiment;

FIG. 8 is a block diagram of a target tracking device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Target tracking refers to predicting the size and position of a corresponding target object in a subsequent frame given the size and position of the target object in an initial frame of a video sequence. In recent years, with the rapid development of deep learning, a convolutional Neural network cnn (convolutional Neural networks) is applied to target tracking, and has an extremely high target feature extraction and expression capability, so that the convolutional Neural network cnn (convolutional Neural networks) has an important meaning for improving the accuracy and robustness of target tracking. The SimeseRPN (SimeseRegion Proposal Network, twin region generation Network) is one of CNN-based target tracking methods, and mainly comprises a template branch and a detection branch, wherein the template branch is used for extracting the characteristics of a first frame, the characteristics are divided into two types, one type is used for classifying the front background, the other type is used for regressing the target position, the detection branch enables a detection frame (subsequent frame image) to pass through a convolution neural Network which is the same as the template branch, and the obtained characteristic graph and the characteristic graph of the template frame are subjected to related operation so as to be matched; under normal conditions, the target displacement between adjacent frames is not too large, if the detection range of the detection frame is the whole frame image, useless calculation amount is greatly increased, and more calculation resources are consumed, so that the detection range of the detection frame is narrowed, the calculation efficiency of the model can be effectively improved, and the method has very important effects on improving the performance of the model and reducing the time consumption of the model. In the practical application of the siemesrpn network structure, the input of the detection branch is usually to intercept the frame image and amplify the frame image into the template frame 2-3 times of image in the size around the tracking frame, although the method reduces the calculation amount of the detection branch to a certain extent, because the motion direction and speed of the target are unknown, the method has strong subjectivity in intercepting the detection range, if the interception range is large, the performance improvement of reducing the calculation amount is limited, if the interception range is small, the target may have moved out of the detection range due to the fast motion speed, so that the tracking target is incomplete or disappeared, the tracking performance is affected, and therefore, the method also has the problem of low tracking accuracy. The embodiment of the application provides a target tracking method, a target tracking device, computer equipment and a storage medium, and aims to solve the technical problems.

The target tracking method provided by the embodiment of the application can be applied to computer equipment, and the internal structure diagram of the computer equipment can be shown in fig. 1. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of object tracking. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 1 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The execution subject of the embodiments of the present application may be a target tracking apparatus or a computer device, and the following embodiments will be described with the computer device as the execution subject.

In an embodiment, a target tracking method is provided, and the embodiment relates to a specific process of predicting a position of a prediction frame where a target is located on a frame image subsequent to a continuous K frame through the continuous K frame image, intercepting an image by using the position of the prediction frame, and tracking the target by using the intercepted image. As shown in fig. 2, the method may include the steps of:

s202, acquiring a video source; the video source comprises at least K +1 frame images, and the positions of target frames are marked on the continuous K frame images of the video source.

Here, the size of K may be determined according to actual conditions, and may be 1, 2, or 3. . . Equal integer size, the frame images in the video source are frame images with a time sequence relationship, and the next frame image of the consecutive K frames can be an image acquired after the K-th frame image in the consecutive K frames is acquired, for example, if there are 1 frame image per second, if there are 10 frame images, the image within 1-10 seconds can be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 frames, if K is 10 here, the next frame of the consecutive K frames can be 1-10 frames, 2-11 frames, 3-12 frames, etc., the next frame of the consecutive K frames can be 11 frames, 12 frames, 13 frames, etc., of course, other consecutive 10 frame images, etc. In addition, the position of the target frame marked on the consecutive K frame images of the video source is the frame position information of the target on the frame images, where the target frame position information may include a center point coordinate of the target and a width and a height of the frame where the target is located, and may also include other information, where the center point coordinate of the target may be a one-dimensional coordinate, a two-dimensional coordinate, or a three-dimensional coordinate, and the width and the height may represent the size of the target. In addition, the target here may be a vehicle, a human face, an animal, or the like.

Specifically, when the vehicle passes through the camera or the camera of the monitoring bayonet, the camera or the camera can acquire the video image of the target in real time, the multi-frame video image can be called as a video source, and the acquired video image is transmitted to the computer equipment connected with the camera or the camera in real time, so that the computer equipment can acquire the acquired video source in real time, and the video source can be processed. When processing, firstly, the computer device can utilize the target detection algorithm to perform target detection on the continuous K frame images in the video source, so that the frame position information of the target on the continuous K frame images can be obtained and recorded as the position of the target frame, and the positions of the target frame detected on each frame image on the continuous K frame images are marked on the corresponding images. In addition, the object detection algorithm here may be yolo algorithm, fast-RCNN algorithm, SSD algorithm, or the like.

S204, respectively inputting the positions of the target frames marked on the continuous K frames of images into a preset prediction network, and determining the positions of the target frames on the next frame of images of the continuous K frames; the preset prediction network is obtained by training the position of a sample target frame in a sample image.

The preset prediction network may be an LSTM network (Long Short Term Memory), and the LSTM network is a time recurrent neural network, and is suitable for processing and predicting important events with relatively Long intervals and delays in a time sequence, so that the LSTM is used herein to estimate the motion trend of the target. In addition, the prediction network may be trained by the position of the target frame marked on each sample image during training, and information included in the position of the target frame marked on each sample image may be all the same. The information included in the position of the prediction frame may be the same as or different from the information included in the position of the target frame, and may include, for example, coordinates of a prediction center point of the target on a frame image subsequent to the consecutive K frames, and a prediction width and a prediction height of a frame where the target is located on the frame image subsequent to the consecutive K frames, or may further include other prediction information, which is not specifically limited in this embodiment.

Specifically, after obtaining the continuous K-frame images and the target frame position information labeled on the continuous K-frame images, the computer device may sequentially input the target frame position information labeled on the continuous K-frame images to the prediction network according to a time sequence, and the prediction network may process the target frame position information of each frame of image in the continuous K-frame images, and then perform convolution or classification on all outputs of the continuous K-frame images, so as to obtain the predicted frame position of the target on the next frame of image of the continuous K-frame.

And S206, according to the position of the prediction frame, intercepting an image corresponding to the position of the prediction frame on the next frame image of the continuous K frames, and inputting the intercepted image into a preset tracking model to obtain the tracking position of the target on the next frame image of the continuous K frames.

When the image is cut out from the next frame image of the continuous K frames by using the position of the prediction frame, the image with the same size is cut out from the next frame image of the continuous K frames as a cut-out image directly according to the size of the position of the prediction frame, or the image with the size larger than the position of the prediction frame is cut out from the next frame image of the continuous K frames as a cut-out image according to the size of the position of the prediction frame, or the image with the size smaller than the position of the prediction frame is cut out from the next frame image of the continuous K frames as a cut-out image according to the size of the position of the prediction frame; if the size of the captured image is not equal to the size of the position of the prediction frame, the captured image may be captured by performing equal scaling according to the size of the position of the prediction frame, or may be captured by performing unequal scaling, which is not specifically limited in this embodiment. For example, assuming that the position of the prediction frame includes the center point coordinate and the width and height of the prediction frame, the center point coordinate is assumed to be (3,2), the width of the prediction frame is assumed to be 5, and the height is assumed to be 10, when performing the clipping on the image of the next frame of the consecutive K frames, the position of the coordinate (3,2) on the image of the next frame of the consecutive K frames can be found first, then the frame with the width of 5 and the height of 10 is found with the found position as the center, and the found frame is the position of the prediction frame, and then the clipping image on the position of the prediction frame can be performed.

The tracking position here refers to a specific tracking position of the target obtained by performing correlation processing such as detection on the next frame image of consecutive K frames using the tracking model. The tracking model may be a CNN model (Convolutional Neural Networks), for example, a siemesrpn model (Siamese regional generation Network) in the CNN model, but may be other models.

Specifically, after obtaining the position information of the prediction frame of the target on the next frame image of the consecutive K frames, the computer device may find the position of the prediction frame on the next frame image of the consecutive K frames according to the size and the position of the prediction frame, intercept the found position by the size of the prediction frame to obtain an intercepted image, input the intercepted image to the tracking model for processing to obtain the specific tracking position of the target on the next frame image of the consecutive K frames, and after obtaining the specific tracking position, the tracking model may accurately track the target. Since the size of the captured image is generally smaller than that of the original image, when the tracking model processes the captured image, the amount of data input by the process is smaller than that of the process of processing the original image, so that the amount of calculation is small, and the calculation efficiency is high.

In the target tracking method, a video source is obtained, wherein the video source comprises at least K +1 frames of images, the positions of target frames of the continuous K frames of the video source are marked, the positions of the target frames marked on the continuous K frames of images are respectively input into a preset prediction network, the position of a prediction frame of a target on the next frame of image of the continuous K frames is determined, an image corresponding to the position of the prediction frame is intercepted on the next frame of image of the continuous K frames according to the position of the prediction frame, and the intercepted image is input into a preset tracking model, so that the tracking position of the target on the next frame of image of the continuous K frames is obtained. In the method, because the image which is input into the tracking model for tracking at last is intercepted on the next frame image of the continuous K frame images according to the position of the prediction frame instead of the next frame whole image of the continuous K frame, the size of the intercepted image is smaller relative to the size of the next frame whole image of the continuous K frame, therefore, when the intercepted image is used for target tracking, the calculation amount of the tracking process can be reduced, the time consumption of the tracking model is reduced, the calculation efficiency of the tracking process can be improved, and the performance of the tracking model is improved; in addition, the predicted frame position in the method is predicted in the prediction network through the target frame positions of the continuous K-frame images, so that the predicted frame position obtained by the method is more accurate, and the target tracking accuracy can be improved.

In another embodiment, another target tracking method is provided, and the embodiment relates to a specific process of amplifying the predicted frame position and performing image interception according to the amplified frame position. On the basis of the above embodiment, as shown in fig. 3, the above S206 may include the following steps:

s302, amplifying the predicted frame position according to a preset proportion to obtain an amplified frame position.

For example, assuming that the predicted frame position is two-dimensional including width and height, the preset ratio of width may be given as 1.1, and the ratio of height may be given as 1.2, and certainly, the preset ratio of width and height may be given as 1.1 or 1.2, which is not particularly limited in this embodiment. In addition, in this step, optionally, the preset ratio may be a ratio greater than 1 and less than 2; that is, the preset scale, i.e., the enlargement scale factor, is larger than 1 and smaller than 2, regardless of whether the length, width, and height of the prediction box are enlarged in an equal-scale method or in an unequal-scale method.

Specifically, after obtaining the position information of the predicted frame of the target on the image of the next frame of the consecutive K frames, the computer device may enlarge the predicted frame in the original position according to a preset proportion, that is, the position of the center point of the predicted frame is not changed, and enlarge the length, the width and the height according to the preset proportion to obtain a new frame and the position information of the new frame, where the new frame is denoted as an enlarged frame, and the position information of the new frame is denoted as the position of the enlarged frame.

S304, capturing an image corresponding to the position of the amplification frame on the next frame image of the continuous K frames, and inputting the image corresponding to the position of the amplification frame into a preset tracking model.

Specifically, after obtaining the position information of the amplification frame of the target on the subsequent frame image of the continuous K frames, the computer device may find the position of the amplification frame on the subsequent frame image of the continuous K frames according to the size and the position of the amplification frame, intercept the position found by the size of the amplification frame to obtain an intercepted image, input the intercepted image to the tracking model for processing to obtain a specific tracking position of the target on the subsequent frame image of the continuous K frames, and after obtaining the specific tracking position, the tracking model may accurately track the target.

It should be noted that, the computer device may also find the position of the prediction frame on the next frame image of the consecutive K frames, then perform amplification on the found position to obtain the position of the amplification frame, and then perform truncation on the position of the amplification frame by the size of the amplification frame to obtain the truncated image.

In the target tracking method provided in this embodiment, the predicted frame position is amplified according to a preset ratio to obtain an amplified frame position, an image corresponding to the amplified frame position is captured from a subsequent frame image of consecutive K frames, and the image corresponding to the amplified frame position is input to a preset tracking model. In the embodiment, because the edge expansion can be performed slightly around the position of the obtained prediction frame, and then the image is intercepted by using the size after the edge expansion, the problem of calculation error caused by insufficient size of the intercepted image can be avoided, meanwhile, the detection range of the tracking model can be effectively reduced, useless calculation in the tracking process is effectively reduced, and therefore, the calculation efficiency is improved.

In another embodiment, another target tracking method is provided, and this embodiment relates to a specific process of how to input the captured image into the preset tracking model to obtain a tracking position of the target on a frame of image subsequent to the consecutive K frames if the preset tracking model includes a template branch network and a detection branch network. On the basis of the above embodiment, as shown in fig. 4a, the above S206 may include the following steps:

s402, according to the position of a target frame marked on the Kth frame image in the continuous K frames of the video source, capturing an image corresponding to the marked position of the target frame on the Kth frame image in the continuous K frames, and inputting the image corresponding to the marked position of the target frame on the Kth frame image in the continuous K frames into a template branch network for feature extraction to obtain the feature of the target.

S404, inputting the image corresponding to the position of the intercepted prediction frame on the next frame image of the continuous K frames into the detection branch network, and combining the characteristics of the target output by the template branch network to obtain the tracking position of the target on the next frame image of the continuous K frames.

In this embodiment, the preset tracking model may be a CNN model, and as shown in fig. 4b, the preset tracking model may include a template branch network and a detection branch network, the template branch network is configured to perform feature extraction on a target region in a template frame image (feature extraction may be performed in an intermediate layer of the template branch network) to obtain a feature of the target, and the detection branch network is configured to extract a feature of the detection region in a subsequent frame image according to the feature extracted by the template branch network, match the feature with the feature extracted by the template branch network, and track the target according to a matching result.

Specifically, the frame positions (i.e., the target frame positions) where the targets are located are marked on the continuous K frame images of the video source, that is, the continuous K frame images all include the marked target frame positions, so that the computer device can intercept the images where the target frame positions are located on the K-th frame image in the continuous K frames according to the target frame positions marked thereon, and input the intercepted images into the template branch network for feature extraction to obtain features related to the targets; in addition, the computer equipment can also intercept an image on the next frame image of the continuous K frames according to the position of the prediction frame or the position of the amplification frame, input the image into the detection branch network, similarly extract the features to obtain the detection target features, match the detection target features with the features extracted by the template branch network, and when the matching is successful, the corresponding position of the detection target features on the next frame image of the continuous K frames is the specific tracking position of the target.

In the target tracking method provided by this embodiment, according to a target frame position labeled on a kth frame image in consecutive K frames of a video source, an image corresponding to the labeled target frame position is captured from the kth frame image in the consecutive K frames, and the image corresponding to the target frame position labeled on the kth frame image in the consecutive K frames is input to a template branch network for feature extraction, so as to obtain a feature of a target; and inputting an image corresponding to the position of the intercepted prediction frame on the next frame image of the continuous K frames into the detection branch network, and obtaining the tracking position of the target on the next frame image of the continuous K frames by combining the characteristics of the target output by the template branch network. In the embodiment, the target can be tracked by using the template branch network and the detection branch network of the tracking model, and the tracking model has extremely high target feature extraction and expression capability, so that the accuracy and robustness of target tracking can be improved by using the tracking model.

In another embodiment, another target tracking method is provided, and the embodiment relates to a specific process for training a prediction network. On the basis of the above embodiment, as shown in fig. 5, the training method of the prediction network may include the following steps:

s502, acquiring a sample video source; the sample video source comprises at least N +1 frames of sample images, and the positions of target frames of the sample images of the sample video source are marked.

The size of N here may be determined according to actual circumstances, may be the same as the size of K, and may be 1, 2, or 3. . . And the number of the frames is equal to the integer, the frame images in the sample video source are also frame images with a time sequence relationship, and the (N + 1) th frame image can be an image acquired after the Nth frame image is acquired. In addition, the position of the target frame marked on the first N frames of sample images of the sample video source is the frame position information of the target on the frame sample image, where the target frame position information may include the center point coordinate of the target and the width and height of the frame where the target is located, and may also include other information, where the center point coordinate of the target may be a one-dimensional coordinate, a two-dimensional coordinate, or a three-dimensional coordinate, and the width and height may represent the size of the target.

Specifically, the sample video source may be the same as the above-mentioned acquisition method in S202, or may be another acquisition method such as acquiring a history sample video source stored in advance, and the present embodiment is not particularly limited. After the computer device obtains the sample video source, the video source can be processed, when the processing is performed, the computer device can perform target detection on each frame of sample image in the sample video source by using a target detection algorithm, so that frame position information of a target on each frame of sample image can be obtained and recorded as a target frame position, and the target frame positions detected on each frame of sample image are marked on the corresponding image.

S504, taking continuous N frames of images of the sample video source as input of the initial prediction network, taking the position of a prediction frame of a target on a frame of image behind the continuous N frames of images as output of the initial prediction network, and training the initial prediction network to obtain the prediction network.

In this step, N and K may be the same or different in size, that is, the number of next frame images to be predicted using how many frame images may be determined according to actual conditions when the prediction network is trained and actually used, and both may be the same or different, for example, N and K may both be 10, and of course, each may take other numbers.

In addition, when the prediction network is trained, not every frame of image is used only once, but some frame of images are repeatedly used in the training process, for example: when the prediction network is trained, frames 1-10 can be used for predicting frame 11, frames 2-11 can be used for predicting frame 12, frames 3-12 can be used for predicting frame 13, and the like, wherein the images of frames 2 and 3 are used twice; next, in the training, when predicting the frame position of the next frame image of each consecutive N frame images, the target frame marked on the consecutive N frame images is used, for example: when predicting the 12 th frame, the labels of the 2 nd to 11 th frames are all the target frames marked on the images of the 2 th to 11 th frames, but the labels of the 2 th to 11 th frames are not the target frames marked, and the label of the 11 th frame is the target frame obtained by the previous prediction.

It should be noted that, the prediction network is different in the actual use process and the training process, and the above example of predicting the 12 th frame by using the 2 nd to 11 th frames is continued, and when the prediction network is used to predict the 12 th frame in the actual use, the label of the 2 nd to 11 th frames is the labeled target frame of the 2 th to 10 th frames, and the label of the 11 th frame is the target frame obtained by the previous prediction, which is exactly opposite to the training.

Optionally, when training is specifically performed, the training may be performed by using specific steps shown in fig. 6, and as shown in fig. 6, the specific training steps may include the following steps S602 to S606:

s602, inputting the continuous N frames of images of the sample video source into the initial prediction network to obtain the prediction frame position of the target on the next frame of images of the continuous N frames of images.

S604, inputting the position of the target prediction frame on the next frame image of the continuous N frame images and the position of the target frame marked on the next frame image of the continuous N frame images into a preset loss function to obtain the value of the loss function.

And S606, adjusting parameters of the initial prediction network according to the value of the loss function until the value of the loss function reaches a preset standard value, so as to obtain the prediction network.

Specifically, after obtaining the position of the target frame marked on each frame of sample image, the computer device may sequentially input the positions of the target frames of the consecutive N frames of images into the initial prediction network according to a time sequence, the prediction network may respectively process the position information of the target frame of each frame of image in the consecutive N frames of images, and then perform convolution or classification and other processing on all outputs of the previous N frames of images, so as to obtain the position of the target prediction frame on the next frame of image of the consecutive N frames of images; and then inputting the marked target frame position and the obtained prediction frame position on the next frame image of the continuous N frame images into a preset loss function, calculating the loss between the marked target frame position and the prediction frame position, taking the loss as the value of the loss function, and adjusting the related parameters of the initial prediction network according to the value of the loss function until the value of the loss function reaches a standard value, thereby obtaining the prediction network. The criterion value here may be a value set according to actual conditions, and the loss may be an error, a variance, a norm, or the like between the predicted frame position and the noted target frame position. Illustratively, the consecutive N frames may be 1-10 frames, and then the next frame of the consecutive N frames of images is 11 frames, and the consecutive N frames may also be 2-11 frames, and then the next frame of the consecutive N frames of images is 11 frames, and so on.

It should be noted that the prediction frame position here is not a final prediction value, but is only a process quantity, and after the prediction network is trained, the obtained prediction frame position is a true prediction frame position.

In the target tracking method provided by this embodiment, a sample video source is obtained, where the sample video source includes at least N +1 frames of sample images with target frame positions already labeled, and consecutive N frames of images of the sample video source are used as input of an initial prediction network, and a prediction frame position of a target on a next frame of image of the consecutive N frames of images is used as output of the initial prediction network, and the initial prediction network is trained to obtain a prediction network. In this embodiment, since the final prediction network is obtained by training the sample frame image labeled with the position of the target frame, the obtained prediction network is relatively accurate, and further, when the target trajectory is predicted by using the accurate prediction network, the obtained prediction position is relatively accurate.

In another embodiment, another target tracking method is provided, and this embodiment relates to a specific process of respectively inputting the positions of target frames marked on consecutive K frame images to a preset prediction network and determining the positions of the target frames on the subsequent frame image of the consecutive K frames if the preset prediction network includes an active layer and a prediction layer. On the basis of the above embodiment, as shown in fig. 7a, the above S206 may include the following steps:

and S702, respectively inputting the positions of the target frames marked on the continuous K frames of images into an activation layer for processing to obtain activation information corresponding to each frame of image.

And S704, respectively inputting the activation information corresponding to each frame of image into a prediction layer to carry out full-connection convolution processing, so as to obtain the position of a prediction frame of the target on the next frame of image of the continuous K frames.

In the present embodiment, taking the tracked vehicle, the position of the target frame as the center point coordinate and the width and height of the target frame as an example, before the test or training is started, for a specific marked target (1 target vehicle) in each image sequence of frames of a video source, the center point coordinate of the target in each video frame is taken as (x, y), the width of the target frame is w, and the height of the target frame is h, and (x, y, w, h) is written into a txt document line by line, wherein the txt document contains training data or test data in the form of [ (x1, y1, w1, h1), (x2, y2, w2, h2),., (xn, yn, wn, hn) ], where n is the total frame number of the video, and one txt document stores data of one target vehicle.

After the txt document is built, the computer device may read the frame information of the labeled data line by line in each txt document, where each line of data is represented by a matrix L of 1 × 4, i.e., Li ═ xi yi wi hi, where i represents the line number where the data is located, and input a sequence of k L (the line number is k) into the prediction network (in this case, LSTM network) as shown in fig. 7b, respectively, where Lpre is the predicted target frame position information, i.e., the predicted frame position, and at each time t (i.e., each frame), the prediction network updates the activation state ht of the hidden layer once for each input Lt in the input sequence, as shown in fig. 7c, and finally obtains h ═ h1, h2, ·, ht, ·, and the specific calculation formula of the activation layer of the prediction network is as follows (1) - (6):

c_t＝i_tc_t+f_tc_t-1(5)

h_t＝o_ttanh(c_t) (6)

in the formulae (1) to (6), i_t，o_t，f_tRespectively representing an input gate, a forgetting gate and an output gate. W denotes the weight matrix of the input-hidden layer, U denotes the state-state cyclic weight matrix, and b is the bias vector. The hidden state of the predicted network is (c)_t,h_t) Wherein long-term memory is preserved at c_tIn (c)_tIs influenced by an input gate and a forgetting gate, and an output gate is used for controlling h_tAnd (4) updating.

After each h is obtained, each h may be input into a prediction layer, for example, after all h is connected by a full connection layer, each h is input into a softmax layer to be convolved, and a final prediction result, that is, a prediction frame position, may be obtained, where the center point position of the target on the k +1 frame image and the width and height of the target may also be obtained.

In the target tracking method provided by this embodiment, the prediction network includes an active layer and a prediction layer, and the positions of the target frames marked on the consecutive K frames of images are respectively input to the active layer for processing, so as to obtain the active information corresponding to each frame of image, and the active information corresponding to each frame of image is respectively input to the prediction layer for full-connection convolution processing, so as to obtain the position of the target frame on the next frame of image of the consecutive K frames. In this embodiment, the motion trend of the target in the subsequent frames can be predicted by using the prediction network and the time-series target frame position information of each frame image according to the position movement of the target in the frame image of each frame in the time-series order, so that the accuracy of the predicted motion trend is high.

It should be understood that although the individual steps in the flowcharts of fig. 2, 3, 4a, 5, 6, 7a are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 3, 4a, 5, 6, and 7a may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 8, there is provided a target tracking apparatus including: an obtaining module 10, a predicting module 11 and a determining module 12, wherein:

an obtaining module 10, configured to obtain a video source; the video source comprises at least K +1 frames of images, and the positions of target frames are marked on the continuous K frames of images of the video source;

the prediction module 11 is configured to input the positions of the target frames marked on the consecutive K frame images to a preset prediction network, and determine the positions of the target frames on the next frame image of the consecutive K frames; the preset prediction network is obtained by training the position of a sample target frame in a sample image;

and the determining module 12 is configured to intercept, according to the position of the prediction frame, an image corresponding to the position of the prediction frame on the next frame image of the consecutive K frames, and input the intercepted image to a preset tracking model to obtain a tracking position of the target on the next frame image of the consecutive K frames.

For specific limitations of the target tracking device, reference may be made to the above limitations of the target tracking method, which are not described herein again.

In another embodiment, another target tracking apparatus is provided, and on the basis of the above embodiment, the determining module 12 may include: an amplification unit and a truncation unit, wherein:

the amplification unit is used for amplifying the predicted frame position according to a preset proportion to obtain an amplification frame position;

and the intercepting unit is used for intercepting the image corresponding to the position of the amplification frame on the next frame image of the continuous K frames and inputting the image corresponding to the position of the amplification frame into a preset tracking model.

Optionally, the preset ratio is a ratio greater than 1 and less than 2.

In another embodiment, another target tracking apparatus is provided, where the preset tracking model includes a template branch network and a detection branch network, and on the basis of the foregoing embodiment, the determining module 12 may include: template unit and detecting element, wherein:

the template unit is used for intercepting an image corresponding to the marked target frame position on the K frame image in the continuous K frames according to the marked target frame position on the K frame image in the continuous K frames of the video source, and inputting the image corresponding to the marked target frame position on the K frame image in the continuous K frames to the template branch network for feature extraction to obtain the features of a target;

and the detection unit is used for inputting an image corresponding to the position of the prediction frame intercepted from the image of the next frame of the continuous K frames into the detection branch network, and obtaining the tracking position of the target on the image of the next frame of the continuous K frames by combining the characteristics of the target output by the template branch network.

In another embodiment, another target tracking apparatus is provided, and on the basis of the above embodiment, the apparatus may further include a training module, which may include an obtaining unit and a training unit, wherein:

the acquisition unit is used for acquiring a sample video source; the sample video source comprises at least N +1 frames of sample images, and the positions of target frames of the sample images of the sample video source are marked;

and the training unit is used for taking the continuous N frames of images of the sample video source as the input of an initial prediction network, taking the position of a prediction frame of a target on a frame of image after the continuous N frames of images as the output of the initial prediction network, and training the initial prediction network to obtain the prediction network.

Optionally, the training unit is further configured to input the N consecutive frames of images of the sample video source into an initial prediction network, so as to obtain a prediction frame position of a target on a frame of image subsequent to the N consecutive frames of images; inputting the position of a prediction frame of a target on a frame image next to the continuous N frame images and the position of a marked target frame on the frame image next to the continuous N frame images into a preset loss function to obtain a value of the loss function; and adjusting parameters of the initial prediction network according to the value of the loss function until the value of the loss function reaches a preset standard value, so as to obtain the prediction network.

In another embodiment, another target tracking apparatus is provided, where the preset prediction network includes an active layer and a prediction layer, and on the basis of the above embodiment, the determining module 12 may include a first processing unit and a second processing unit, where:

the first processing unit is used for respectively inputting the positions of the target frames marked on the continuous K frames of images to the activation layer for processing to obtain activation information corresponding to each frame of image;

and the second processing unit is used for respectively inputting the activation information corresponding to each frame of image into the prediction layer to carry out full-connection convolution processing so as to obtain the position of a prediction frame of the target on the next frame of image of the continuous K frames.

The modules in the target tracking device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

In one embodiment, the processor, when executing the computer program, further performs the steps of:

In one embodiment, the preset ratio is a ratio greater than 1 and less than 2.

according to the position of a target frame marked on a Kth frame image in continuous K frames of the video source, capturing an image corresponding to the marked target frame position on the Kth frame image in the continuous K frames, and inputting the image corresponding to the marked target frame position on the Kth frame image in the continuous K frames into the template branch network for feature extraction to obtain the feature of a target;

and inputting an image corresponding to the position of the prediction frame intercepted from the image of the next frame of the continuous K frames into the detection branch network, and obtaining the tracking position of the target on the image of the next frame of the continuous K frames by combining the characteristics of the target output by the template branch network.

and taking the continuous N frames of images of the sample video source as the input of an initial prediction network, taking the position of a prediction frame of a target on a frame of image after the continuous N frames of images as the output of the initial prediction network, and training the initial prediction network to obtain the prediction network.

respectively inputting the positions of the target frames marked on the continuous K frames of images into the activation layer for processing to obtain activation information corresponding to each frame of image;

and respectively inputting the activation information corresponding to each frame of image into the prediction layer to carry out full-connection convolution processing, so as to obtain the position of a prediction frame of the target on the next frame of image of the continuous K frames.

In one embodiment, a readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

In one embodiment, the preset ratio is a ratio greater than 1 and less than 2.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of target tracking, the method comprising:

2. The method according to claim 1, wherein the intercepting, according to the position of the prediction frame, an image corresponding to the position of the prediction frame on an image of a frame subsequent to the consecutive K frames, and inputting the intercepted image to a preset tracking model, includes:

3. The method according to claim 2, wherein the preset ratio is a ratio greater than 1 and less than 2.

4. The method according to claim 1, wherein the preset tracking model comprises a template branch network and a detection branch network, and the inputting the intercepted image into the preset tracking model to obtain the tracking position of the target on the image of the next frame of the consecutive K frames comprises:

5. The method according to any one of claims 1 to 4, wherein the method for training the preset predictive network comprises:

6. The method according to claim 5, wherein the training the initial prediction network with the N consecutive frame images of the sample video source as input of the initial prediction network and the predicted frame position of the target on the frame image subsequent to the N consecutive frame images as output of the initial prediction network to obtain the prediction network comprises:

7. The method according to claim 1, wherein the preset prediction network comprises an active layer and a prediction layer, and the inputting the target frame positions marked on the consecutive K frame images into the preset prediction network respectively determines the prediction frame positions of the target on the frame images subsequent to the consecutive K frames, comprises:

8. An object tracking apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.