CN111428567A - Pedestrian tracking system and method based on affine multi-task regression - Google Patents

Pedestrian tracking system and method based on affine multi-task regression Download PDF

Info

Publication number
CN111428567A
CN111428567A CN202010118387.6A CN202010118387A CN111428567A CN 111428567 A CN111428567 A CN 111428567A CN 202010118387 A CN202010118387 A CN 202010118387A CN 111428567 A CN111428567 A CN 111428567A
Authority
CN
China
Prior art keywords
target
frame
affine
tracking
target object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010118387.6A
Other languages
Chinese (zh)
Other versions
CN111428567B (en
Inventor
谢英红
韩晓微
刘天惠
涂斌斌
唐璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang University
Original Assignee
Shenyang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang University filed Critical Shenyang University
Priority to CN202010118387.6A priority Critical patent/CN111428567B/en
Publication of CN111428567A publication Critical patent/CN111428567A/en
Application granted granted Critical
Publication of CN111428567B publication Critical patent/CN111428567B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/02Affine transformations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a pedestrian tracking system and method based on affine multi-task regression, and relates to the technical field of computer vision. The method determines that the last frame of a plurality of video frames comprises a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background, thereby obtaining multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.

Description

Pedestrian tracking system and method based on affine multi-task regression
Technical Field
The invention relates to the technical field of computer vision, in particular to a pedestrian tracking system and method based on affine multi-task regression.
Background
The pedestrian tracking technology is to recognize and track the pedestrian target on the picture in the video and the image through the computer vision technology. The pedestrian identification tracking project is regarded as a key research project by many countries, and the project is emphasized because the technology is advanced and has wide hunting, namely the technology can be used for battlefield detection, target tracking, accurate guidance and the like in the field of national defense and military, the technology can be used for intelligent traffic, violation detection, unmanned driving and the like in the field of urban traffic, and the technology can be used for people flow monitoring and the like in the field of social security.
The prior patent application CN108629791A provides a pedestrian tracking method and device and a cross-camera pedestrian tracking method and device. The pedestrian tracking method comprises the following steps: acquiring a video; performing pedestrian detection on at least part of video frames in the video to obtain a pedestrian frame in each of the at least part of video frames; for each pedestrian frame in all the obtained pedestrian frames, processing image blocks contained in the pedestrian frame by using a trained convolutional neural network to obtain a feature vector of the pedestrian frame; and matching all pedestrian frames based on the feature vector of each pedestrian frame in the all pedestrian frames to obtain a pedestrian tracking result, wherein the pedestrian tracking result comprises at least one pedestrian track. The method and the device are not limited by position information, have good robustness, can realize accurate and efficient pedestrian tracking, and can easily realize the pedestrian tracking across the cameras.
The method is characterized in that deformation of geometric and optical properties of an image can be well kept without deformation, when the image is processed under the existing Gamma normalization condition, the attitude floating range of pedestrians is large, most of fine actions do not affect the detection effect, and a pedestrian detection method of HOG and SVM is selected.
CN110414439A discloses an anti-blocking pedestrian tracking method based on multi-peak detection, which comprises the steps of firstly, carrying out pedestrian detection to obtain an initial position, carrying out initialization of tracker parameters and a pedestrian template, taking the position of a characteristic fusion response peak value as a pedestrian predicted position center in each subsequent frame, carrying out calculation of a target response peak value Fmax, an average peak value correlation energy APCE and a threshold value thereof, carrying out filter response multi-peak detection through a formed combined confidence coefficient, thereby realizing pedestrian blocking judgment, suspending updating of filter parameters and a pedestrian target template in a blocking frame, and realizing an anti-blocking pedestrian tracking task. The method selects the FHOG characteristic and the Color Naming characteristic to perform self-adaptive fusion as the characteristic descriptor, so that the robustness of the pedestrian tracking method on pedestrian deformation and illumination is improved; and the updating of the pedestrian template and the filter parameters is suspended in the pedestrian shielding frame, so that the problem of easy tracking position drift is solved.
CN108509859A discloses a pedestrian tracking method based on a deep neural network in a non-overlapping area, which comprises the following steps of (1) detecting a current pedestrian target in a monitoring video image by adopting a YO L O algorithm and segmenting a pedestrian target picture, (2) tracking and predicting a detection result by using a Kalman algorithm, (3) extracting a depth characteristic of a picture by using a convolutional neural network, wherein the picture comprises a candidate pedestrian picture and a target pedestrian picture in the step (2), and storing the picture of the candidate pedestrian and the characteristic thereof, and (4) calculating the similarity between the target pedestrian characteristic and the candidate pedestrian characteristic and sequencing the candidate pedestrian characteristic and identifying the target pedestrian.
However, the deep learning networks described above or other popular deep learning networks currently have no special solution for accurate positioning of deformed objects.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a pedestrian tracking system and method based on affine multi-task regression. By applying affine transformation to a deep learning network, accurate tracking of a deformed target is obtained.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
in one aspect, the invention provides a pedestrian tracking system based on affine multi-task regression, comprising a memory and a processor;
the memory is used for storing computer executable instructions;
the processor is operable to execute the executable instructions by determining that a previous frame of the plurality of video frames includes a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background, thereby obtaining multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.
On the other hand, the invention also provides a pedestrian tracking method based on affine multitask regression, which is realized by adopting the pedestrian tracking system based on affine multitask regression, and the method comprises the following steps:
step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;
step 2: determining a current target frame including the target object in the current frame according to the determined target frame;
and step 3: and adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained first neural network, acquiring a candidate characteristic diagram of the target frame in the current frame, and designing a loss function.
The first neural network is a VGG-16 network;
the loss function of the VGG-16 network is expressed as:
Figure DEST_PATH_IMAGE001
wherein, α1And α2Is the learning rate.pIs a categorytcA logarithmic loss of whereinL c p,tc)=-logp tc
iThe number of the regression box indicating the loss being calculated;
tcthe representation is a category label, for example:tc=1 is a representation of the target,tc=0 represents background;
xywhand other variables, respectively, in abscissa/ordinate/width/height.
Parameter(s)v i =(v x, v y, v w, v h ) Is a real rectangular bounding box tuple comprising a central point abscissa, an ordinate, a width and a height;
Figure 572685DEST_PATH_IMAGE002
the predicted target frame tuple comprises a central point abscissa, an ordinate, a width and a height;
u i =(r1,r2,r3,r4,r5,r6) An affine parameter tuple of the real target area;
Figure DEST_PATH_IMAGE003
predicting an affine parameter tuple of the target area;
Figure 404988DEST_PATH_IMAGE004
r1,r2,r3,r4,r5,r6) fixing values of six components of the structure for affine transformation of the real target region;
Figure 915603DEST_PATH_IMAGE004
r1*r2*r3*r4*r5*r6*) Predicting values of six components of the affine transformation fixed structure of the target area;
Figure DEST_PATH_IMAGE005
representing an affine bounding box parameter loss function;
Figure 331891DEST_PATH_IMAGE006
representing a rectangular bounding box parametric loss function;
let (ww*) To represent
Figure DEST_PATH_IMAGE007
Or
Figure 602467DEST_PATH_IMAGE008
,
Figure DEST_PATH_IMAGE009
Definition ofComprises the following steps:
Figure 712112DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE011
whereinxAre real numbers.
And 4, step 4: inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions;
the second neural network is an RPN network.
And 5: performing pooling operation on the characteristics of the target candidate regions to obtain a plurality of interested regions aiming at the target object;
step 6: performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background, thereby obtaining multiple tracking affine frames of the target object;
and 7: and carrying out non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.
Step 7.1: scoring and comparing the characteristics corresponding to the tracking affine frames to obtain scores of target/background results of the compared target areas;
step 7.2: judging the score result larger than a certain threshold value as a target area, otherwise, judging the score result as a background area;
step 7.3: performing non-maximum suppression on the features of the target area to obtain a tracking result of the target object of the current frame;
and 8: and (3) judging whether the number of the next frame of the current image is less than the total frame number of the video, if not, directly finishing, if so, returning to the step (2), and tracking the next frame of the image until all the frames of the video are tracked.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:
the method and the device utilize the affine transformation parameter information of the previous frame image to cut the current target image, reduce the search range and improve the algorithm efficiency. In addition, the cut image is input into the VGG-16 network to calculate the feature and then input into the RPN network, so that the repeated calculation of feature extraction is avoided, and the algorithm efficiency is improved. In the present application, the features output by the highest layer of the network are used as semantic models, and affine transformation results are used as spatial models, which form complementary advantages, because the features of the highest layer contain more semantic information and less spatial information. Furthermore, the above-described multitask loss function including affine transformation parameter regression optimizes network performance.
Drawings
FIG. 1 is a block diagram of an implementation of an embodiment of the invention using a computer architecture.
Fig. 2 is a flowchart of a pedestrian tracking algorithm according to an embodiment of the present invention.
FIG. 3 is a schematic block diagram of a process flow of an embodiment of the present invention.
Fig. 4 is a comparison graph of the effects of the horizontal NMS and the affine transformation NMS of the embodiment of the present invention.
FIG. 5 is a graph of the tracking results of the embodiment of the present invention.
Fig. 6 shows a network structure of VGG-16 according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings.
In one aspect, the invention provides a pedestrian tracking system based on affine multi-task regression, comprising a memory and a processor;
the memory is used for storing computer executable instructions;
the processor is operable to execute the executable instructions by determining that a previous frame of the plurality of video frames includes a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background, thereby obtaining multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.
As shown in fig. 1, a schematic diagram of an electronic system 600 suitable for use in implementing embodiments of the present disclosure is shown. The electronic system shown in fig. 1 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present disclosure.
As shown in fig. 1, electronic system 600 may include a processing device (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage device 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
In general, input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 607 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 608 including, for example, magnetic tape, hard disk, etc., and communication devices 609 may allow electronic system 600 to communicate wirelessly or wiredly with other devices to exchange data.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.
It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer-readable medium may be embodied in the electronic system (also referred to herein as an "affine multi-task regression-based pedestrian tracking system"); or may exist separately and not be assembled into the electronic system. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic system to: 1) determining that a previous frame of the plurality of video frames comprises a target frame of a target object; 2) determining a current target frame including the target object in the current frame according to the determined target frame; 3) inputting the current target frame into a pre-trained first neural network, and acquiring a candidate feature map of the target frame in the current frame; 4) inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions; 5) Pooling features of the target candidate regions to obtain a plurality of regions of interest for the target object; 6) performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background so as to obtain multiple tracking affine frames of the target object; and 7) carrying out non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.
On the other hand, the invention also provides a pedestrian tracking method based on affine multitask regression, as shown in fig. 2, which is implemented by adopting the pedestrian tracking system based on affine multitask regression, and the method comprises the following steps:
step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;
the method comprises the steps of initializing the size of an original image, setting the size of the original image to be m × n (unit: pixel), manually marking the position of a target frame of the frame when t =1, marking the central position of the target frame as (cx, cy), wherein t represents the image of the t-th frame, t is a positive integer, cx and cy are the horizontal and vertical coordinates of the central position of the target frame respectively, and the target frame comprises an object to be tracked, such as the object 301 in FIG. 3.
Initializing affine transformation parameters:U 1=[r1,r2,r3,r4,r5,r6] T
step 2: determining a current target frame including the target object in the current frame according to the determined target frame;
for example, assuming that two side lengths of a circumscribed rectangle of the target frame in the t-1 frame are marked as a, b, on the t-1 frame image, a picture of size (2 a) × (2 b), such as a rectangular frame marked as 302 in fig. 3, is cut out centering on the target center point (cx, cy) of the t-1 frame, in the present application, the purpose of centering on the center point of the target of the previous frame is to make the cut-out picture include target information, because the coordinates of the center point of the target of two adjacent frames do not change much, and as long as the coordinates of the center point of the target of two adjacent frames do not change much, the target to be tracked can be included in the cut-out sub-picture as long as the target of a sufficient size is cut out at a position near the target center point.
And step 3: and adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained first neural network, acquiring a candidate characteristic diagram of the target frame in the current frame, and designing a loss function.
And adjusting the cut target frame into a fixed size, sending the fixed size into a pre-trained neural network, for example, into the VGG-16 network, and acquiring a feature map of the image after fifth layer convolution in the network, namely acquiring a candidate feature map of the target frame in the image. Such as indicated by reference numeral 303 in fig. 3.
The first neural network is a VGG-16 network, fig. 6 shows an exemplary VGG-16 network structure, fig. 6 shows that the network structure comprises 13 convolutional layers (201) and 3 fully connected layers (203), specifically, as shown in fig. 6, the convolutional layers are first constructed with a filter with 3 × and step size 1, assuming that the network input size is m × n × (m and n are positive integers), in order to ensure that the first two dimensions of the feature matrix after convolution are the same as the first two dimensions of the input matrix, i.e., m × n, one additional 0 turn is added to the input matrix, the dimension of the input matrix is changed to (m +2) × (n +2), and then 3 × is convolved, the first two dimensions of the feature matrix after convolution are still m × n, then one filter with 2 × 2, the largest pooling layer 202 is constructed with a filter with step 2, then three convolutions with 256 same filters, then three convolutions with three times, then three times of pooling, activates the pool functions with more than the current functions, i.e., ×, and finally activates the resulting functions of the object from the activation of the operation of the convolutional layers, i.e., the result of the activation of the operation of the corresponding to obtain 1000, i.e., the activation of the filter 357, i.e., the activation of.
The method includes constructing the VGG-16 network, training the VGG-16 network using an ImageNet dataset, the ImageNet dataset being divided into a training set and a testing set, the dataset corresponding to, for example, 1000 classes, each data having a corresponding label vector, each label vector corresponding to a different class, such as a target object or a background, the application not being concerned with the specific classification of the input image, but using the dataset to train the weights of the VGG-16 network, specifically adjusting the ImageNet training set to 224 × 224 × 3, then inputting the VGG-16 network to train the network to obtain information on the weight parameters of the layers or cells of the network, then inputting a predetermined testing dataset and label vectors of the corresponding class into the trained VGG-16 network structure, the size of the testing dataset can, for example, again be 224 × 224 × 3, the output of the VGG-16 network can be detected by inputting the testing dataset and label vectors of the corresponding class into the VGG-16 network, the results of the testing dataset can be adjusted to achieve, for example, the accuracy of the above test parameters can be adjusted by comparing the error of the predetermined G-16 network (e.g., error rate 98).
And 4, step 4: inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions;
the feature map obtained from the neural network is input into an rpn (region pro-social network) network, and candidate regions for obtaining a plurality of targets, for example, 2000 candidate regions, are extracted. Such as that indicated by reference numeral 304 in fig. 3. The RPN is a network that generates a plurality of candidate areas of different sizes, unlike the VGG-16 network. The candidate region is a region of a plurality of shapes and positions where the target of the current frame may exist. According to the method, a plurality of regions which may exist in the algorithm are estimated in advance, and then optimization regression is carried out on the regions, so that more accurate tracking regions are screened out.
The second neural network is an RPN network.
And 5: performing pooling operation on the characteristics of the target candidate regions to obtain a plurality of interested regions aiming at the target object;
the features of these candidate regions of different sizes are pooled to obtain a plurality of regions of interest (ROI) for the target object, here, a plurality of convolution kernels of different sizes, for example, three convolution kernels, respectively 7 × 7, 5 × 9 and 9 × 5, are designed in the pooling layer in consideration of the deformation of the target, for example, as shown by reference numeral 305 in fig. 3. a plurality of different pooling kernels may primarily describe the deformation of the target, for example, 7 × 7, 5 × 9 may describe the person standing under different cameras, 9 × 5 may describe the action of bending the person, etc., of course, different size pooling kernels may be designed according to different application scenarios.
Step 6: performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background, thereby obtaining multiple tracking affine frames of the target object;
the result of the pooling, i.e. the features of the multiple regions of interest (ROIs), is subjected to a full join operation. Here, the full linking operation is to concatenate a plurality of ROI features in sequence. Such as that indicated by reference numeral 306 in fig. 3. Then, the series-connected features are subjected to score comparison by using a softmax function, and the score of the target/background result of the compared target area is obtained. For example, a region with a score greater than a certain threshold is determined as a target region, otherwise, the region is a background region.
And 7: and carrying out non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.
Step 7.1: scoring and comparing the characteristics corresponding to the tracking affine frames to obtain scores of target/background results of the compared target areas;
step 7.2: judging the score result larger than a certain threshold value as a target area, otherwise, judging the score result as a background area; and
step 7.3: and performing non-maximum suppression on the features of the determined target area to obtain the tracking result of the target object of the current frame.
The obtained affine area determined as the target area is subjected to non-local maximum suppression (for example, as indicated by reference numeral 308 in fig. 3), and a tracking result of the t-frame image, that is, a corresponding affine parameter and a frame are obtained. Such as that shown at reference numeral 309 in fig. 3. In one embodiment, the multiple tracked affine frames may be compared with a reference target frame (i.e., a target frame tracked in a previous frame), and an affine tracking frame with a largest overlapping area is obtained as a final tracking result. The specific algorithm is described below.
Alternatively, the loss and regression need to be calculated first, optimizing the affine transformation parameters. The loss function design for the entire network of VGG-16 described above can be expressed, for example, as:
the loss function of the VGG-16 network is expressed as:
Figure 941099DEST_PATH_IMAGE012
(1)
wherein, α1And α2Is the learning rate.pIs a categorytcThe logarithmic loss of (c) is shown in equation (2).
L c p,tc)=-logp tc (2)
iThe number of the regression box indicating the loss being calculated;
tcthe representation is a category label, for example:tc=1 is a representation of the target,tc=0 represents background;
xywhand other variables, respectively, in abscissa/ordinate/width/height.
Parameter(s)v i =(v x, v y, v w, v h ) Is a real rectangular bounding box tuple comprising a central point abscissa, an ordinate, a width and a height;
Figure 316454DEST_PATH_IMAGE002
the predicted target frame tuple comprises a central point abscissa, an ordinate, a width and a height;
u i =(r1,r2,r3,r4,r5,r6) An affine parameter tuple of the real target area;
Figure 249775DEST_PATH_IMAGE003
predicting an affine parameter tuple of the target area;
Figure 997283DEST_PATH_IMAGE004
r1,r2,r3,r4,r5,r6) fixing values of six components of the structure for affine transformation of the real target region;
Figure 52963DEST_PATH_IMAGE004
r1*r2*r3*r4*r5*r6*) Predicting values of six components of the affine transformation fixed structure of the target area;
Figure 604030DEST_PATH_IMAGE005
representing an affine bounding box parameter loss function;
Figure 938714DEST_PATH_IMAGE006
representing a rectangular bounding box parametric loss function;
let (ww*) To represent
Figure 524416DEST_PATH_IMAGE007
Or
Figure 767310DEST_PATH_IMAGE008
,
Figure 805673DEST_PATH_IMAGE009
Is defined as:
Figure 657960DEST_PATH_IMAGE010
(3)
Figure 35852DEST_PATH_IMAGE011
(4)
whereinxAre real numbers.
Affine transformation is used herein to represent the object geometric deformation. First, thetThe affine transformation parameters of the tracking result of the target region of the frame are writtenU t The structure is as follows:U t =[r1,r2,r3,r4,r5,r6] T . Corresponding affine transformation matrix
Figure DEST_PATH_IMAGE013
The utility model has the advantages of having a plum group structure,ga(2) is corresponding to affine lie groupGA(2) Lie algebra, matrix ofG j
Figure 574280DEST_PATH_IMAGE014
) Is thatGA(2) Generator and matrix ofga(2) The group (2) of (a). For matrixGA(2) The generating element of (1) is:
Figure DEST_PATH_IMAGE015
(5)
for the lie group matrix, the riemann distance is defined as the matrix logarithm:
Figure 414454DEST_PATH_IMAGE016
(6)
where X and Y are elements of the lie group matrix, giving a symmetric positive definite matrix of N
Figure DEST_PATH_IMAGE017
The inner mean of (d) defines:
Figure 634214DEST_PATH_IMAGE018
(7)
wherein
Figure DEST_PATH_IMAGE019
qIs a constant;
and carrying out non-maximum suppression on the tracking affine frames to obtain a tracking result of the t frame image. A plurality of different target areas can be obtained through regression, and in order to obtain a detection algorithm with the highest accuracy correctly, an affine transformation non-maximum suppression method is adopted to screen out the final tracking result. In addition, the loss function is designed, the affine deformation of the target is taken into consideration, and the accuracy of predicting the position of the target is improved.
Current object detection methods, non-maxima suppression (NMS), are widely used as post-processing detection candidates. The method can estimate the axis alignment boundary box and the inclined boundary box, and can execute normal NMS on the axis alignment boundary box and also can execute inclined NMS on the affine transformation boundary box. In affine transformation non-maximum suppression, the computation of the conventional intersection (IoU) is modified to IoU between the two affine bounding boxes. The effect of the algorithm is shown in fig. 4. In fig. 4, each frame with the number 401 is a candidate frame before the suppression of the non-maximum value, the frame with the number 402 is a frame obtained after the suppression of the normal NMS, and the frame with the number 403 is a frame obtained by the suppression of the affine transformation non-maximum value according to the present application. It can be seen that the tracking frame obtained by the method is more accurate.
And 8: and (4) determining whether the number of the t +1 frames is less than the total frame number of the video, and if so, returning to the step 2 to track the t +1 th frame image. And ending the algorithm until all the video frames are tracked. The partial tracking result frames are shown as black frames indicated by arrows 501, 502, 503, 504 in fig. 5.
According to the method and the device, the current target image is cut by using the affine transformation parameter information of the previous frame image, the search range is narrowed, and the algorithm efficiency is improved. In addition, the cut image is input into the VGG-16 network to calculate the feature and then input into the RPN network, so that the repeated calculation of feature extraction is avoided, and the algorithm efficiency is improved. In addition, during the pooling operation, convolution kernels with different sizes and shapes are applied to preliminarily simulate the deformation of the target, and the target position can be accurately extracted. In the present application, the features output by the highest layer of the network are used as semantic models, and affine transformation results are used as spatial models, which form complementary advantages, because the features of the highest layer contain more semantic information and less spatial information. Furthermore, the above-described multitask loss function including affine transformation parameter regression optimizes network performance.
In the above pedestrian tracking system, the first neural network is a VGG-16 network, and the second neural network is an RPN network.
In the above-mentioned pedestrian tracking system, the candidate regions obtained from the second neural network are regions of a plurality of shapes and positions where the target object in the current frame exists, and furthermore, the step 5 pools the features of the plurality of target candidate regions by a plurality of convolution kernels of different sizes to obtain a plurality of regions of interest for the target object.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (5)

1. A pedestrian tracking system based on affine multitask regression is characterized in that: comprising a memory and a processor;
the memory is used for storing computer executable instructions;
the processor is operable to execute the executable instructions by determining that a previous frame of the plurality of video frames includes a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background, thereby obtaining multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.
2. An affine multitask regression-based pedestrian tracking method realized by the affine multitask regression-based pedestrian tracking system according to claim 1, characterized by comprising the following steps of:
step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;
step 2: determining a current target frame including the target object in the current frame according to the determined target frame;
and step 3: adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained first neural network, acquiring a candidate characteristic diagram of the target frame in the current frame, and designing a loss function;
and 4, step 4: inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions;
and 5: performing pooling operation on the characteristics of the target candidate regions to obtain a plurality of interested regions aiming at the target object;
step 6: performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background, thereby obtaining multiple tracking affine frames of the target object;
and 7: performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame;
and 8: and (3) judging whether the number of the next frame of the current image is less than the total frame number of the video, if not, directly finishing, if so, returning to the step (2), and tracking the next frame of the image until all the frames of the video are tracked.
3. The pedestrian tracking method based on affine multitask regression, characterized in that the first neural network is a VGG-16 network, and the second neural network is an RPN network.
4. The pedestrian tracking method based on affine multitask regression as claimed in claim 2, wherein said loss function in step 3 is expressed as:
Figure 993224DEST_PATH_IMAGE001
wherein, α1And α2Is the learning rate;
pis a categorytcA logarithmic loss of whereinL c p,tc)=-logp tc
iThe number of the regression box indicating the loss being calculated;
tcthe representation is a category label, for example:tc=1 is a representation of the target,tc=0 represents background;
xywhused in combination with other variables to represent abscissa/ordinate/width/height, respectively;
parameter(s)v i =(v x, v y, v w, v h ) Is a real rectangular bounding box tuple comprising a central point abscissa, an ordinate, a width and a height;
Figure 148131DEST_PATH_IMAGE002
the predicted target frame tuple comprises a central point abscissa, an ordinate, a width and a height;
u i =(r1,r2,r3,r4,r5,r6) An affine parameter tuple of the real target area;
Figure 165634DEST_PATH_IMAGE003
predicting an affine parameter tuple of the target area;
Figure 814922DEST_PATH_IMAGE004
r1,r2,r3,r4,r5,r6) is composed ofValues of six components of an affine transformation fixed structure of the real target region;
Figure 945689DEST_PATH_IMAGE004
r1*r2*r3*r4*r5*r6*) Predicting values of six components of the affine transformation fixed structure of the target area;
Figure 22229DEST_PATH_IMAGE005
representing an affine bounding box parameter loss function;
Figure 789678DEST_PATH_IMAGE006
representing a rectangular bounding box parametric loss function;
let (ww*) To represent
Figure 39394DEST_PATH_IMAGE007
Or
Figure 962351DEST_PATH_IMAGE008
,
Figure 193481DEST_PATH_IMAGE009
Is defined as:
Figure 467467DEST_PATH_IMAGE010
Figure 707824DEST_PATH_IMAGE011
whereinxAre real numbers.
5. The pedestrian tracking method based on affine multi-task regression as claimed in claim 3, wherein said step 7 specifically comprises:
step 7.1: scoring and comparing the characteristics corresponding to the tracking affine frames to obtain scores of target/background results of the compared target areas;
step 7.2: judging the score result larger than a certain threshold value as a target area, otherwise, judging the score result as a background area; and
step 7.3: and performing non-maximum suppression on the features of the determined target area to obtain the tracking result of the target object of the current frame.
CN202010118387.6A 2020-02-26 2020-02-26 Pedestrian tracking system and method based on affine multitask regression Active CN111428567B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010118387.6A CN111428567B (en) 2020-02-26 2020-02-26 Pedestrian tracking system and method based on affine multitask regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010118387.6A CN111428567B (en) 2020-02-26 2020-02-26 Pedestrian tracking system and method based on affine multitask regression

Publications (2)

Publication Number Publication Date
CN111428567A true CN111428567A (en) 2020-07-17
CN111428567B CN111428567B (en) 2024-02-02

Family

ID=71547182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010118387.6A Active CN111428567B (en) 2020-02-26 2020-02-26 Pedestrian tracking system and method based on affine multitask regression

Country Status (1)

Country Link
CN (1) CN111428567B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022037587A1 (en) * 2020-08-19 2022-02-24 Zhejiang Dahua Technology Co., Ltd. Methods and systems for video processing
WO2022133911A1 (en) * 2020-12-24 2022-06-30 深圳市大疆创新科技有限公司 Target detection method and apparatus, movable platform, and computer-readable storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093480A (en) * 2013-01-15 2013-05-08 沈阳大学 Particle filtering video image tracking method based on dual model
CN105389832A (en) * 2015-11-20 2016-03-09 沈阳大学 Video object tracking method based on Grassmann manifold and projection group
CN106683091A (en) * 2017-01-06 2017-05-17 北京理工大学 Target classification and attitude detection method based on depth convolution neural network
US9946960B1 (en) * 2017-10-13 2018-04-17 StradVision, Inc. Method for acquiring bounding box corresponding to an object in an image by using convolutional neural network including tracking network and computing device using the same
CN108280855A (en) * 2018-01-13 2018-07-13 福州大学 A kind of insulator breakdown detection method based on Fast R-CNN
CN109255351A (en) * 2018-09-05 2019-01-22 华南理工大学 Bounding box homing method, system, equipment and medium based on Three dimensional convolution neural network
CN109961034A (en) * 2019-03-18 2019-07-02 西安电子科技大学 Video object detection method based on convolution gating cycle neural unit
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN110399808A (en) * 2019-07-05 2019-11-01 桂林安维科技有限公司 A kind of Human bodys' response method and system based on multiple target tracking
CN110458864A (en) * 2019-07-02 2019-11-15 南京邮电大学 Based on the method for tracking target and target tracker for integrating semantic knowledge and example aspects
CN110781350A (en) * 2019-09-26 2020-02-11 武汉大学 Pedestrian retrieval method and system oriented to full-picture monitoring scene

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093480A (en) * 2013-01-15 2013-05-08 沈阳大学 Particle filtering video image tracking method based on dual model
CN105389832A (en) * 2015-11-20 2016-03-09 沈阳大学 Video object tracking method based on Grassmann manifold and projection group
CN106683091A (en) * 2017-01-06 2017-05-17 北京理工大学 Target classification and attitude detection method based on depth convolution neural network
US9946960B1 (en) * 2017-10-13 2018-04-17 StradVision, Inc. Method for acquiring bounding box corresponding to an object in an image by using convolutional neural network including tracking network and computing device using the same
CN108280855A (en) * 2018-01-13 2018-07-13 福州大学 A kind of insulator breakdown detection method based on Fast R-CNN
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN109255351A (en) * 2018-09-05 2019-01-22 华南理工大学 Bounding box homing method, system, equipment and medium based on Three dimensional convolution neural network
CN109961034A (en) * 2019-03-18 2019-07-02 西安电子科技大学 Video object detection method based on convolution gating cycle neural unit
CN110458864A (en) * 2019-07-02 2019-11-15 南京邮电大学 Based on the method for tracking target and target tracker for integrating semantic knowledge and example aspects
CN110399808A (en) * 2019-07-05 2019-11-01 桂林安维科技有限公司 A kind of Human bodys' response method and system based on multiple target tracking
CN110781350A (en) * 2019-09-26 2020-02-11 武汉大学 Pedestrian retrieval method and system oriented to full-picture monitoring scene

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
谢英红;庞彦伟;韩晓微;田丹;: "基于Grassmann流形和投影群的目标跟踪", 仪器仪表学报, no. 05 *
郭强;芦晓红;谢英红;孙鹏;: "基于深度谱卷积神经网络的高效视觉目标跟踪算法", 红外与激光工程, no. 06 *
高琳;王俊峰;范勇;陈念年: "基于卷积神经网络与一致性预测器的稳健视觉跟踪", 光学学报, vol. 37, no. 8 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022037587A1 (en) * 2020-08-19 2022-02-24 Zhejiang Dahua Technology Co., Ltd. Methods and systems for video processing
WO2022133911A1 (en) * 2020-12-24 2022-06-30 深圳市大疆创新科技有限公司 Target detection method and apparatus, movable platform, and computer-readable storage medium

Also Published As

Publication number Publication date
CN111428567B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
US11144889B2 (en) Automatic assessment of damage and repair costs in vehicles
CN112926410B (en) Target tracking method, device, storage medium and intelligent video system
CN109035304B (en) Target tracking method, medium, computing device and apparatus
CN107748873B (en) A kind of multimodal method for tracking target merging background information
US8948454B2 (en) Boosting object detection performance in videos
CN107529650B (en) Closed loop detection method and device and computer equipment
CN111797893A (en) Neural network training method, image classification system and related equipment
CN108805016B (en) Head and shoulder area detection method and device
CN110633745A (en) Image classification training method and device based on artificial intelligence and storage medium
CN109800682B (en) Driver attribute identification method and related product
Ye et al. A two-stage real-time YOLOv2-based road marking detector with lightweight spatial transformation-invariant classification
CN111931764A (en) Target detection method, target detection framework and related equipment
CN111626295B (en) Training method and device for license plate detection model
CN111401143A (en) Pedestrian tracking system and method
US20190311216A1 (en) Image processing device, image processing method, and image processing program
CN111428566B (en) Deformation target tracking system and method
CN113673505A (en) Example segmentation model training method, device and system and storage medium
CN111428567A (en) Pedestrian tracking system and method based on affine multi-task regression
CN113160117A (en) Three-dimensional point cloud target detection method under automatic driving scene
Alsanad et al. Real-time fuel truck detection algorithm based on deep convolutional neural network
CN116453109A (en) 3D target detection method, device, equipment and storage medium
Fan et al. Covered vehicle detection in autonomous driving based on faster rcnn
Mehta et al. Identifying most walkable direction for navigation in an outdoor environment
CN112487927B (en) Method and system for realizing indoor scene recognition based on object associated attention
CN111881833A (en) Vehicle detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant