CN111428567A

CN111428567A - Pedestrian tracking system and method based on affine multi-task regression

Info

Publication number: CN111428567A
Application number: CN202010118387.6A
Authority: CN
Inventors: 谢英红; 韩晓微; 刘天惠; 涂斌斌; 唐璐
Original assignee: Shenyang University
Current assignee: Shenyang University
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-07-17
Anticipated expiration: 2040-02-26
Also published as: CN111428567B

Abstract

The invention provides a pedestrian tracking system and method based on affine multi-task regression, and relates to the technical field of computer vision. The method determines that the last frame of a plurality of video frames comprises a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background, thereby obtaining multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.

Description

Pedestrian tracking system and method based on affine multi-task regression

Technical Field

The invention relates to the technical field of computer vision, in particular to a pedestrian tracking system and method based on affine multi-task regression.

Background

The pedestrian tracking technology is to recognize and track the pedestrian target on the picture in the video and the image through the computer vision technology. The pedestrian identification tracking project is regarded as a key research project by many countries, and the project is emphasized because the technology is advanced and has wide hunting, namely the technology can be used for battlefield detection, target tracking, accurate guidance and the like in the field of national defense and military, the technology can be used for intelligent traffic, violation detection, unmanned driving and the like in the field of urban traffic, and the technology can be used for people flow monitoring and the like in the field of social security.

The prior patent application CN108629791A provides a pedestrian tracking method and device and a cross-camera pedestrian tracking method and device. The pedestrian tracking method comprises the following steps: acquiring a video; performing pedestrian detection on at least part of video frames in the video to obtain a pedestrian frame in each of the at least part of video frames; for each pedestrian frame in all the obtained pedestrian frames, processing image blocks contained in the pedestrian frame by using a trained convolutional neural network to obtain a feature vector of the pedestrian frame; and matching all pedestrian frames based on the feature vector of each pedestrian frame in the all pedestrian frames to obtain a pedestrian tracking result, wherein the pedestrian tracking result comprises at least one pedestrian track. The method and the device are not limited by position information, have good robustness, can realize accurate and efficient pedestrian tracking, and can easily realize the pedestrian tracking across the cameras.

The method is characterized in that deformation of geometric and optical properties of an image can be well kept without deformation, when the image is processed under the existing Gamma normalization condition, the attitude floating range of pedestrians is large, most of fine actions do not affect the detection effect, and a pedestrian detection method of HOG and SVM is selected.

CN110414439A discloses an anti-blocking pedestrian tracking method based on multi-peak detection, which comprises the steps of firstly, carrying out pedestrian detection to obtain an initial position, carrying out initialization of tracker parameters and a pedestrian template, taking the position of a characteristic fusion response peak value as a pedestrian predicted position center in each subsequent frame, carrying out calculation of a target response peak value Fmax, an average peak value correlation energy APCE and a threshold value thereof, carrying out filter response multi-peak detection through a formed combined confidence coefficient, thereby realizing pedestrian blocking judgment, suspending updating of filter parameters and a pedestrian target template in a blocking frame, and realizing an anti-blocking pedestrian tracking task. The method selects the FHOG characteristic and the Color Naming characteristic to perform self-adaptive fusion as the characteristic descriptor, so that the robustness of the pedestrian tracking method on pedestrian deformation and illumination is improved; and the updating of the pedestrian template and the filter parameters is suspended in the pedestrian shielding frame, so that the problem of easy tracking position drift is solved.

CN108509859A discloses a pedestrian tracking method based on a deep neural network in a non-overlapping area, which comprises the following steps of (1) detecting a current pedestrian target in a monitoring video image by adopting a YO L O algorithm and segmenting a pedestrian target picture, (2) tracking and predicting a detection result by using a Kalman algorithm, (3) extracting a depth characteristic of a picture by using a convolutional neural network, wherein the picture comprises a candidate pedestrian picture and a target pedestrian picture in the step (2), and storing the picture of the candidate pedestrian and the characteristic thereof, and (4) calculating the similarity between the target pedestrian characteristic and the candidate pedestrian characteristic and sequencing the candidate pedestrian characteristic and identifying the target pedestrian.

However, the deep learning networks described above or other popular deep learning networks currently have no special solution for accurate positioning of deformed objects.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a pedestrian tracking system and method based on affine multi-task regression. By applying affine transformation to a deep learning network, accurate tracking of a deformed target is obtained.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

in one aspect, the invention provides a pedestrian tracking system based on affine multi-task regression, comprising a memory and a processor;

the memory is used for storing computer executable instructions;

the processor is operable to execute the executable instructions by determining that a previous frame of the plurality of video frames includes a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background, thereby obtaining multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.

On the other hand, the invention also provides a pedestrian tracking method based on affine multitask regression, which is realized by adopting the pedestrian tracking system based on affine multitask regression, and the method comprises the following steps:

step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;

step 2: determining a current target frame including the target object in the current frame according to the determined target frame;

and step 3: and adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained first neural network, acquiring a candidate characteristic diagram of the target frame in the current frame, and designing a loss function.

The first neural network is a VGG-16 network;

the loss function of the VGG-16 network is expressed as:

wherein, α₁And α₂Is the learning rate.pIs a categorytcA logarithmic loss of whereinL _c（p,tc）=-logp _tc；

iThe number of the regression box indicating the loss being calculated;

tcthe representation is a category label, for example:tc=1 is a representation of the target,tc=0 represents background;

x，y，w，hand other variables, respectively, in abscissa/ordinate/width/height.

Parameter(s)v _i=（v _x， v _y， v _w， v _h) Is a real rectangular bounding box tuple comprising a central point abscissa, an ordinate, a width and a height;

the predicted target frame tuple comprises a central point abscissa, an ordinate, a width and a height;

u _i=（r1,r2,r3,r4,r5,r6) An affine parameter tuple of the real target area;

predicting an affine parameter tuple of the target area;

r1，r2，r3，r4，r5，r6) fixing values of six components of the structure for affine transformation of the real target region;

r1^*，r2^*，r3^*，r4^*，r5^*，r6^*) Predicting values of six components of the affine transformation fixed structure of the target area;

representing an affine bounding box parameter loss function;

representing a rectangular bounding box parametric loss function;

let (w，w*) To represent

Or

,

Definition ofComprises the following steps:

whereinxAre real numbers.

And 4, step 4: inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions;

the second neural network is an RPN network.

And 5: performing pooling operation on the characteristics of the target candidate regions to obtain a plurality of interested regions aiming at the target object;

step 6: performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background, thereby obtaining multiple tracking affine frames of the target object;

and 7: and carrying out non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.

Step 7.1: scoring and comparing the characteristics corresponding to the tracking affine frames to obtain scores of target/background results of the compared target areas;

step 7.2: judging the score result larger than a certain threshold value as a target area, otherwise, judging the score result as a background area;

step 7.3: performing non-maximum suppression on the features of the target area to obtain a tracking result of the target object of the current frame;

and 8: and (3) judging whether the number of the next frame of the current image is less than the total frame number of the video, if not, directly finishing, if so, returning to the step (2), and tracking the next frame of the image until all the frames of the video are tracked.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:

the method and the device utilize the affine transformation parameter information of the previous frame image to cut the current target image, reduce the search range and improve the algorithm efficiency. In addition, the cut image is input into the VGG-16 network to calculate the feature and then input into the RPN network, so that the repeated calculation of feature extraction is avoided, and the algorithm efficiency is improved. In the present application, the features output by the highest layer of the network are used as semantic models, and affine transformation results are used as spatial models, which form complementary advantages, because the features of the highest layer contain more semantic information and less spatial information. Furthermore, the above-described multitask loss function including affine transformation parameter regression optimizes network performance.

Drawings

FIG. 1 is a block diagram of an implementation of an embodiment of the invention using a computer architecture.

Fig. 2 is a flowchart of a pedestrian tracking algorithm according to an embodiment of the present invention.

FIG. 3 is a schematic block diagram of a process flow of an embodiment of the present invention.

Fig. 4 is a comparison graph of the effects of the horizontal NMS and the affine transformation NMS of the embodiment of the present invention.

FIG. 5 is a graph of the tracking results of the embodiment of the present invention.

Fig. 6 shows a network structure of VGG-16 according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

the memory is used for storing computer executable instructions;

As shown in fig. 1, a schematic diagram of an electronic system 600 suitable for use in implementing embodiments of the present disclosure is shown. The electronic system shown in fig. 1 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present disclosure.

As shown in fig. 1, electronic system 600 may include a processing device (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage device 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 607 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 608 including, for example, magnetic tape, hard disk, etc., and communication devices 609 may allow electronic system 600 to communicate wirelessly or wiredly with other devices to exchange data.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer-readable medium may be embodied in the electronic system (also referred to herein as an "affine multi-task regression-based pedestrian tracking system"); or may exist separately and not be assembled into the electronic system. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic system to: 1) determining that a previous frame of the plurality of video frames comprises a target frame of a target object; 2) determining a current target frame including the target object in the current frame according to the determined target frame; 3) inputting the current target frame into a pre-trained first neural network, and acquiring a candidate feature map of the target frame in the current frame; 4) inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions; 5) Pooling features of the target candidate regions to obtain a plurality of regions of interest for the target object; 6) performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background so as to obtain multiple tracking affine frames of the target object; and 7) carrying out non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.

On the other hand, the invention also provides a pedestrian tracking method based on affine multitask regression, as shown in fig. 2, which is implemented by adopting the pedestrian tracking system based on affine multitask regression, and the method comprises the following steps:

the method comprises the steps of initializing the size of an original image, setting the size of the original image to be m × n (unit: pixel), manually marking the position of a target frame of the frame when t =1, marking the central position of the target frame as (cx, cy), wherein t represents the image of the t-th frame, t is a positive integer, cx and cy are the horizontal and vertical coordinates of the central position of the target frame respectively, and the target frame comprises an object to be tracked, such as the object 301 in FIG. 3.

Initializing affine transformation parameters:U ₁=[r1，r2，r3，r4，r5，r6]^T。

for example, assuming that two side lengths of a circumscribed rectangle of the target frame in the t-1 frame are marked as a, b, on the t-1 frame image, a picture of size (2 a) × (2 b), such as a rectangular frame marked as 302 in fig. 3, is cut out centering on the target center point (cx, cy) of the t-1 frame, in the present application, the purpose of centering on the center point of the target of the previous frame is to make the cut-out picture include target information, because the coordinates of the center point of the target of two adjacent frames do not change much, and as long as the coordinates of the center point of the target of two adjacent frames do not change much, the target to be tracked can be included in the cut-out sub-picture as long as the target of a sufficient size is cut out at a position near the target center point.

And adjusting the cut target frame into a fixed size, sending the fixed size into a pre-trained neural network, for example, into the VGG-16 network, and acquiring a feature map of the image after fifth layer convolution in the network, namely acquiring a candidate feature map of the target frame in the image. Such as indicated by reference numeral 303 in fig. 3.

The first neural network is a VGG-16 network, fig. 6 shows an exemplary VGG-16 network structure, fig. 6 shows that the network structure comprises 13 convolutional layers (201) and 3 fully connected layers (203), specifically, as shown in fig. 6, the convolutional layers are first constructed with a filter with 3 × and step size 1, assuming that the network input size is m × n × (m and n are positive integers), in order to ensure that the first two dimensions of the feature matrix after convolution are the same as the first two dimensions of the input matrix, i.e., m × n, one additional 0 turn is added to the input matrix, the dimension of the input matrix is changed to (m +2) × (n +2), and then 3 × is convolved, the first two dimensions of the feature matrix after convolution are still m × n, then one filter with 2 × 2, the largest pooling layer 202 is constructed with a filter with step 2, then three convolutions with 256 same filters, then three convolutions with three times, then three times of pooling, activates the pool functions with more than the current functions, i.e., ×, and finally activates the resulting functions of the object from the activation of the operation of the convolutional layers, i.e., the result of the activation of the operation of the corresponding to obtain 1000, i.e., the activation of the filter 357, i.e., the activation of.

The method includes constructing the VGG-16 network, training the VGG-16 network using an ImageNet dataset, the ImageNet dataset being divided into a training set and a testing set, the dataset corresponding to, for example, 1000 classes, each data having a corresponding label vector, each label vector corresponding to a different class, such as a target object or a background, the application not being concerned with the specific classification of the input image, but using the dataset to train the weights of the VGG-16 network, specifically adjusting the ImageNet training set to 224 × 224 × 3, then inputting the VGG-16 network to train the network to obtain information on the weight parameters of the layers or cells of the network, then inputting a predetermined testing dataset and label vectors of the corresponding class into the trained VGG-16 network structure, the size of the testing dataset can, for example, again be 224 × 224 × 3, the output of the VGG-16 network can be detected by inputting the testing dataset and label vectors of the corresponding class into the VGG-16 network, the results of the testing dataset can be adjusted to achieve, for example, the accuracy of the above test parameters can be adjusted by comparing the error of the predetermined G-16 network (e.g., error rate 98).

the feature map obtained from the neural network is input into an rpn (region pro-social network) network, and candidate regions for obtaining a plurality of targets, for example, 2000 candidate regions, are extracted. Such as that indicated by reference numeral 304 in fig. 3. The RPN is a network that generates a plurality of candidate areas of different sizes, unlike the VGG-16 network. The candidate region is a region of a plurality of shapes and positions where the target of the current frame may exist. According to the method, a plurality of regions which may exist in the algorithm are estimated in advance, and then optimization regression is carried out on the regions, so that more accurate tracking regions are screened out.

The second neural network is an RPN network.

the features of these candidate regions of different sizes are pooled to obtain a plurality of regions of interest (ROI) for the target object, here, a plurality of convolution kernels of different sizes, for example, three convolution kernels, respectively 7 × 7, 5 × 9 and 9 × 5, are designed in the pooling layer in consideration of the deformation of the target, for example, as shown by reference numeral 305 in fig. 3. a plurality of different pooling kernels may primarily describe the deformation of the target, for example, 7 × 7, 5 × 9 may describe the person standing under different cameras, 9 × 5 may describe the action of bending the person, etc., of course, different size pooling kernels may be designed according to different application scenarios.

the result of the pooling, i.e. the features of the multiple regions of interest (ROIs), is subjected to a full join operation. Here, the full linking operation is to concatenate a plurality of ROI features in sequence. Such as that indicated by reference numeral 306 in fig. 3. Then, the series-connected features are subjected to score comparison by using a softmax function, and the score of the target/background result of the compared target area is obtained. For example, a region with a score greater than a certain threshold is determined as a target region, otherwise, the region is a background region.

step 7.2: judging the score result larger than a certain threshold value as a target area, otherwise, judging the score result as a background area; and

step 7.3: and performing non-maximum suppression on the features of the determined target area to obtain the tracking result of the target object of the current frame.

The obtained affine area determined as the target area is subjected to non-local maximum suppression (for example, as indicated by reference numeral 308 in fig. 3), and a tracking result of the t-frame image, that is, a corresponding affine parameter and a frame are obtained. Such as that shown at reference numeral 309 in fig. 3. In one embodiment, the multiple tracked affine frames may be compared with a reference target frame (i.e., a target frame tracked in a previous frame), and an affine tracking frame with a largest overlapping area is obtained as a final tracking result. The specific algorithm is described below.

Alternatively, the loss and regression need to be calculated first, optimizing the affine transformation parameters. The loss function design for the entire network of VGG-16 described above can be expressed, for example, as:

the loss function of the VGG-16 network is expressed as:

（1）

wherein, α₁And α₂Is the learning rate.pIs a categorytcThe logarithmic loss of (c) is shown in equation (2).

L _c（p,tc）=-logp _tc（2）

iThe number of the regression box indicating the loss being calculated;

u _i=（r1,r2,r3,r4,r5,r6) An affine parameter tuple of the real target area;

predicting an affine parameter tuple of the target area;

representing an affine bounding box parameter loss function;

representing a rectangular bounding box parametric loss function;

let (w，w*) To represent

Or

,

Is defined as:

(3)

(4)

whereinxAre real numbers.

Affine transformation is used herein to represent the object geometric deformation. First, thetThe affine transformation parameters of the tracking result of the target region of the frame are writtenU _tThe structure is as follows:U _t =[r1,r2,r3,r4,r5,r6]^T. Corresponding affine transformation matrix

The utility model has the advantages of having a plum group structure,ga(2) is corresponding to affine lie groupGA(2) Lie algebra, matrix ofG _j（

) Is thatGA(2) Generator and matrix ofga(2) The group (2) of (a). For matrixGA(2) The generating element of (1) is:

(5)

for the lie group matrix, the riemann distance is defined as the matrix logarithm:

(6)

where X and Y are elements of the lie group matrix, giving a symmetric positive definite matrix of N

The inner mean of (d) defines:

(7)

wherein

，qIs a constant;

and carrying out non-maximum suppression on the tracking affine frames to obtain a tracking result of the t frame image. A plurality of different target areas can be obtained through regression, and in order to obtain a detection algorithm with the highest accuracy correctly, an affine transformation non-maximum suppression method is adopted to screen out the final tracking result. In addition, the loss function is designed, the affine deformation of the target is taken into consideration, and the accuracy of predicting the position of the target is improved.

Current object detection methods, non-maxima suppression (NMS), are widely used as post-processing detection candidates. The method can estimate the axis alignment boundary box and the inclined boundary box, and can execute normal NMS on the axis alignment boundary box and also can execute inclined NMS on the affine transformation boundary box. In affine transformation non-maximum suppression, the computation of the conventional intersection (IoU) is modified to IoU between the two affine bounding boxes. The effect of the algorithm is shown in fig. 4. In fig. 4, each frame with the number 401 is a candidate frame before the suppression of the non-maximum value, the frame with the number 402 is a frame obtained after the suppression of the normal NMS, and the frame with the number 403 is a frame obtained by the suppression of the affine transformation non-maximum value according to the present application. It can be seen that the tracking frame obtained by the method is more accurate.

And 8: and (4) determining whether the number of the t +1 frames is less than the total frame number of the video, and if so, returning to the step 2 to track the t +1 th frame image. And ending the algorithm until all the video frames are tracked. The partial tracking result frames are shown as black frames indicated by

arrows

501, 502, 503, 504 in fig. 5.

According to the method and the device, the current target image is cut by using the affine transformation parameter information of the previous frame image, the search range is narrowed, and the algorithm efficiency is improved. In addition, the cut image is input into the VGG-16 network to calculate the feature and then input into the RPN network, so that the repeated calculation of feature extraction is avoided, and the algorithm efficiency is improved. In addition, during the pooling operation, convolution kernels with different sizes and shapes are applied to preliminarily simulate the deformation of the target, and the target position can be accurately extracted. In the present application, the features output by the highest layer of the network are used as semantic models, and affine transformation results are used as spatial models, which form complementary advantages, because the features of the highest layer contain more semantic information and less spatial information. Furthermore, the above-described multitask loss function including affine transformation parameter regression optimizes network performance.

In the above pedestrian tracking system, the first neural network is a VGG-16 network, and the second neural network is an RPN network.

In the above-mentioned pedestrian tracking system, the candidate regions obtained from the second neural network are regions of a plurality of shapes and positions where the target object in the current frame exists, and furthermore, the step 5 pools the features of the plurality of target candidate regions by a plurality of convolution kernels of different sizes to obtain a plurality of regions of interest for the target object.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A pedestrian tracking system based on affine multitask regression is characterized in that: comprising a memory and a processor;

the memory is used for storing computer executable instructions;

2. An affine multitask regression-based pedestrian tracking method realized by the affine multitask regression-based pedestrian tracking system according to claim 1, characterized by comprising the following steps of:

and step 3: adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained first neural network, acquiring a candidate characteristic diagram of the target frame in the current frame, and designing a loss function;

and 7: performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame;

3. The pedestrian tracking method based on affine multitask regression, characterized in that the first neural network is a VGG-16 network, and the second neural network is an RPN network.

4. The pedestrian tracking method based on affine multitask regression as claimed in claim 2, wherein said loss function in step 3 is expressed as: