CN107993250A

CN107993250A - A kind of fast multi-target pedestrian tracking and analysis method and its intelligent apparatus

Info

Publication number: CN107993250A
Application number: CN201710822109.7A
Authority: CN
Inventors: 何智群; 董远; 白洪亮
Original assignee: Beijing Faceall Co
Current assignee: Beijing Faceall Co
Priority date: 2017-09-12
Filing date: 2017-09-12
Publication date: 2018-05-04

Abstract

The invention discloses a kind of fast multi-target pedestrian tracking and analysis method and its intelligent apparatus, by the first frame, carrying out sectional drawing to object frame region and obtaining target data picture；Target data picture is extracted to the feature of target data picture by convolutional neural networks；Target data filtering parameter is calculated by wave filter according to the feature of the target data picture and obtains prediction block；In the next frame, sectional drawing is carried out to picture by the prediction block and obtains region of search picture, the feature of region of search picture is extracted by convolutional neural networks；Region of search filtering parameter is calculated by wave filter according to the feature of described search region picture；According to described search region filtering parameter obtain response maximum region as final position frame and etc. be trained, and then obtain an efficiently quick CNN tracing model.Make the computing architecture process that repetition is eliminated in practical application, solve the problems, such as and fast track accurate for multiple target pedestrian and analysis.

Description

A kind of fast multi-target pedestrian tracking and analysis method and its intelligent apparatus

Technical field

The present invention relates to a kind of fast multi-target pedestrian tracking and analysis method and its intelligent apparatus.

Background technology

At present, more there are network redundancy in present deep learning tracing algorithm, tracking speed is slow, and model is big, it is difficult to Practicality, can not real-time tracing, the problems such as it is even more impossible to carry out multi-target tracking.To recalculate and carry by model algorithm every time Feature is taken, and then can be formed and cause computing repeatedly, makes reaction speed significantly limited.So it is badly in need of a kind of regarding based on correlation filtering Frequency object tracking method and its intelligent apparatus, are asked to solve the real accurate and fast track now for multiple target pedestrian with what is analyzed Topic.

The content of the invention

The technical problem to be solved in the present invention is solving, traditional tracing algorithm network redundancy is more, and tracking speed is slow, model Greatly, it is difficult to it is practical, can not real-time tracing, it is even more impossible to carry out the technical problem of multi-target tracking.

Above-mentioned technical problem is solved, the present invention provides a kind of object video method for tracing based on correlation filtering, it is wrapped Include：

Sequence of frames of video is zoomed into same scale；

In the first frame, sectional drawing is carried out to object frame region and obtains target data picture；

Target data picture is extracted to the feature of target data picture by convolutional neural networks；

Target data filtering parameter is calculated by wave filter according to the feature of the target data picture, according to the mesh Mark data filtering parameter obtains prediction block；

In the next frame, sectional drawing is carried out to picture by the prediction block and obtains region of search picture；

Region of search picture is extracted to the feature of region of search picture by convolutional neural networks；

Region of search filtering parameter is calculated by wave filter according to the feature of described search region picture；

Response maximum region is obtained as final position frame according to described search region filtering parameter.

Further, obtained according to the feature of described search region pictureAnd in the target at the heart journey is high The responsive tags y ∈ R of this shape^M×N, wherein, M represents the width of Spatial Dimension, and N represents the height of Spatial Dimension, D representative feature passages；

According to formulaFiltering parameter is obtained, wherein, w^lIt is the wave filter ginseng of passage l Number, ★ represent circulation relevant operation, and parameter lambda >=0 is regularization parameter,Represent the discrete Fourier transform of y, y^*Represent plural number y Conjugation；

According to formulaResponse maximum region is obtained as final position frame.

Further, the convolutional neural networks are the twin convolutional neural networks of three layers of shared parameter without zero padding.

Further, the frame picture is amplified 2.5 times before the object frame region progress sectional drawing, the prediction block is to picture The frame picture is amplified 2.5 times before progress sectional drawing.

Present invention also provides a kind of intelligent apparatus, it includes：

Unit for scaling, for sequence of frames of video to be zoomed to same scale；

Sectional drawing unit, in the first frame, carrying out sectional drawing to object frame region and obtaining target data picture；

Extraction unit, for target data picture to be extracted to the feature of target data picture by convolutional neural networks；

Computing unit, target data filtering ginseng is calculated for the feature according to the target data picture by wave filter Number, prediction block is obtained according to the target data filtering parameter；

The sectional drawing unit, is additionally operable in the next frame, and carrying out sectional drawing to picture by the prediction block obtains the field of search Domain picture；

The extraction unit, is additionally operable to extract region of search picture by convolutional neural networks the spy of region of search picture Sign；

The computing unit, is additionally operable to calculate region of search by wave filter according to the feature of described search region picture Filtering parameter；

Acquiring unit, for obtaining corresponding maximum region as final position frame according to described search region filtering parameter.

Further, the extraction unit, the feature for being additionally operable to described search region picture obtainAnd In the target at the heart Cheng Gaosi shapes responsive tags y ∈ R^M×N, wherein, M represents the width of Spatial Dimension, and N represents Spatial Dimension Height, D representative feature passages；

The computing unit is additionally operable to according to formulaFiltering parameter is obtained, wherein, w^l It is the filter parameter of passage l, ★ represents circulation relevant operation, and parameter lambda >=0 is regularization parameter,Represent the direct computation of DFT of y Leaf transformation, y^*The conjugation of plural number y is represented, ⊙ represents the multiplication of matrix correspondence position；

The acquiring unit, is additionally operable to according to formulaObtain response maximum region.

Further, the extraction unit, the convolutional neural networks for extracting feature are three layers of being total to without zero padding Enjoy the twin convolutional neural networks of parameter.

Further, the unit for scaling, is additionally operable to amplify the first frame picture before the object frame region carries out sectional drawing 2.5 times, the prediction block by the second frame picture before picture progress sectional drawing to amplifying 2.5 times.

Beneficial effects of the present invention：

1. speed is fast：Reach the tracking speed to more than single goal 100fps on i5cpu, and the video frame of current main-stream The 25fps of rate, therefore algorithm can carry out real-time tracing to the target object in video.

2. model is small：Model size is 76k, and as the system based on convolutional neural networks, the model of lightweight causes this System can be readily used in embedded device.

3. accuracy rate is high：It is fast in speed, on the basis of model is small, change standard data set OTB of the tracing model in tracking Reach very high accuracy rate in (object tracking benchmark) and VOT (visual object tracking), The pedestrian being entirely capable of suitable for reality follows the trail of scene.

4. multi-target tracking：The tracing model of mainstream is difficult to be tracked multiple target at present, the tracking that the present invention designs System can be tracked multiple targets in real time.

5. training method end to end：The training frame of the tracing system can combine convolutional neural networks and related filter Ripple, reduces and adjusts the cumbersome of ginseng manually, and can reach more preferably performance.

Brief description of the drawings

Fig. 1 is the flow chart of the object video method for tracing to correlation filtering of one embodiment of the application；

Fig. 2 is the Organization Chart of the object video tracking intelligent apparatus to correlation filtering of another embodiment of the application；

Fig. 3 is the application training process overall flow schematic diagram；

Fig. 4 is the flow diagram of the application first embodiment concrete application；

Fig. 5 is the flow diagram of the application second embodiment concrete application；

Fig. 6 is the first schematic diagram of use state of the application image concrete application；

Fig. 7 is the second schematic diagram of use state of the application image concrete application；

Fig. 8 is the 3rd schematic diagram of use state of the application image concrete application；

Embodiment：

Following embodiments are only clearly to invent this example, and not to the limit of embodiments of the present invention It is fixed.For those of ordinary skill in the field, it is various forms of that other can also be made on the basis of the following description Change changes, and these belong to protection of the spiritual obvious changes or variations drawn of the invention still in the present invention Among scope.

Step shown in the flowchart of the accompanying drawings can be in the computer system of such as a group of computer-executable instructions Perform.Also, although logical order is shown in flow charts, in some cases, can be with suitable different from herein Sequence performs shown or described step.

More there are network redundancy in present deep learning tracing algorithm, tracking speed is slow, and model is big, it is difficult to and it is practical, Can not real-time tracing, the problems such as multi-target tracking can not be carried out.In the present invention, by the network structure of lightweight, base is passed through In the end-to-end training method of the correlation filtering of deep learning, the shortcomings that traditional filtering method needs to adjust ginseng manually is avoided.Pass through The twin deep learning network of the lightweight of one 3 layers of design carries out feature extraction, while ensure that speed, is substantially reduced Model capacity.Multiscale analysis is carried out to feature using correlation filtering, can accurately export the position of target.

As shown in figure 3, process is divided into data processing, feature extraction, three parts of correlation filtering are discussed：

1. data processing

Sequence of frames of video is zoomed into same scale, for training picture, centered on the position of object frame, frame is amplified 2.5 times of region carries out sectional drawing, as training picture.Test when, the first frame according to object frame amplify 2.5 times region into Row sectional drawing, by the sectional drawing of generation by convolutional neural networks, calculates filter parameter.Since the second frame, previous frame is utilized Prediction block position amplify 2.5 times to the frame carry out sectional drawing, sectional drawing is equally passed through into the filter of previous frame by convolutional neural networks Ripple device parameter calculates the position of final frame.In tracing process, the renewal of device parameter is filtered every 5 frames so that wave filter The change of the form light and ambient background of target can constantly be adapted to.

2. feature extraction

In the training process, using the twin convolutional neural networks of three layers of shared parameter without zero padding respectively to target data Feature extraction is carried out with region of search.In this way, carry out correlation filtering meter with regard to two eigenmatrixes can be obtained, then by eigenmatrix Calculate, training label is the Gauss shape characteristic pattern of one and input data formed objects, and the response in target area is maximum.Testing During, amplify 2.5 times of progress sectional drawings according only to the prediction block position of previous frame, region of search is obtained, by region of search picture Data are input in three-layer coil product neutral net, are obtained final eigenmatrix, by feature convolution by correlation filter, are obtained The final prediction block position of the frame.

3. correlation filtering

By the feature extraction in 2, obtainIn the target at the heart Cheng Gaosi shapes responsive tags y ∈ R^M×N, wherein M represents the width of Spatial Dimension, and N represents the height of Spatial Dimension, and D representative feature passages, the purpose of correlation filtering is to look for To a filtering parameter so that the loss function ε of ridge regression is minimum：

Wherein, w^lIt is the filter parameter of passage l.★ represents circulation relevant operation, i.e., cyclic shift behaviour is to w and first Make, carrying out related operation, parameter lambda >=0 is regularization parameter.The solution of the equation can be expressed as：

Here,Represent the discrete Fourier transform of y.y^*Represent the conjugation of plural number y.⊙ represents the multiplication of matrix correspondence position. In detection process, we carry out feature extraction for region of search, obtainPass through the following formula：

The region for the maximum that meets with a response.

In its another embodiment, according to Fig. 1, the present invention provides a kind of object video based on correlation filtering Method for tracing, it includes：

S101, same scale is zoomed to by sequence of frames of video；

S102, in the first frame, carries out sectional drawing to object frame region and obtains target data picture；

Region of search, is extracted the feature of target data picture by S103 by convolutional neural networks；

S104, target data filtering parameter is calculated according to the feature of the target data picture by wave filter, according to The target data filtering parameter obtains prediction block；

S105, in the next frame, carries out sectional drawing to picture by the prediction block and obtains region of search picture；

Region of search picture, is extracted the feature of region of search picture by S106 by convolutional neural networks；

S107, region of search filtering parameter is calculated according to the feature of described search region picture by wave filter；

S108, response maximum region is obtained as final position frame according to described search region filtering parameter.

First, video is all zoomed into same scale per frame, sectional drawing is carried out centered on the position of object frame, as training Picture.Feature is extracted by convolutional neural networks to the training picture sectional drawing (i.e. above-mentioned target data picture) of generation, according to The feature of extraction is filtered device calculating, target data filtering parameter is calculated, according to the feelings of the target data filtering parameter Condition speculates to obtain a prediction block, which is to think that the region of maximum probability occurs in target in a hypothesis, is passed through This region further carries out sectional drawing in the next frame.Region of search picture is obtained, convolutional Neural is passed through to region of search picture Network carries out extraction feature, and region of search filtering parameter is calculated again by wave filter, is filtered and joined according to described search region Number obtains response maximum region as final position frame.In the above process, the renewal of device parameter is filtered every 5 frames so that filter Ripple device can constantly adapt to the change of the form light and ambient background of target.

In another alternative embodiment, how will specifically calculate and be illustrated, this method further includes：

Obtained according to the feature of described search region pictureAnd in the target at the heart Cheng Gaosi shapes response Label y ∈ R^M×N, wherein, M represents the width of Spatial Dimension, and N represents the height of Spatial Dimension, D representative feature passages；

According to formulaFiltering parameter is obtained, wherein, w^lIt is the wave filter ginseng of passage l Number, ★ represent circulation relevant operation, and parameter lambda >=0 is regularization parameter,Represent the discrete Fourier transform of y, y^*Represent plural number y Conjugation；⊙ represents the multiplication of matrix correspondence position.Wherein, ★ represents circulation relevant operation and circulative shift operation is done to w and first And calculate.

According to formulaObtain response maximum region.

Secondly, in fact twice extraction with calculate filtering parameter process be as, difference is the second extraction feature When, the maximum region of response can be also obtained as final frame position.

In another alternative embodiment, as shown in figure 3, this method further includes：

Again, the convolutional neural networks are the twin convolutional neural networks of three layers of shared parameter without zero padding, three layers of this kind Structure is the convolutional neural networks of lightweight.Twin is two convolutional neural networks, passes through filter after carrying out extraction feature respectively Ripple device, which calculate, obtains filtering parameter, and then obtains the specific region of frame.

In another alternative embodiment, this method further includes：

The frame picture is amplified 2.5 times before the object frame region progress sectional drawing, the prediction block carries out sectional drawing to picture It is preceding that the frame picture is amplified 2.5 times.The feature extraction after being captured as again is amplified to it to lay the first stone.It has passed through After above-mentioned training process, an efficiently quick CNN tracing model (framework of feature) is trained.Pass through above-mentioned tracing model In service stage afterwards, directly by carrying out feature extraction framework to processing picture, process is recalculated among eliminating, greatly Width improves work efficiency.

Above-mentioned model is illustrated the embodiment after it as follows：

Example 1：Tracing system as shown in Figure 4, first, the target location of the first frame is detected using target detection technique, will Multiple targets to be followed the trail of are added in tracking queue.Next frame picture is inputted, then traversal tracking queue, for each tracking Object invocation tracing algorithm obtains the position of the target in the next frame.The target is obtained after the position of next frame, is passed through Whether the threshold decision target frames out.If the target have left screen, which is removed into tracking object queue.

Every 24 frames, a target detection is called, the result of target detection is calculated into IOU with the result of tracking, if mesh Mark the IOU ＜ 0.1 of some result of detection and all targets of tracking, then it is assumed that new target adds screen, by the target Add in tracking queue.If IOU ＞ 0.5, the frame of tracking is substituted using the frame of target detection, carries out position correction.Wherein, IOU(intersection over Union)：Hand over and compare, two intersection of sets collection divided by two union of sets collection.

It is (meeting either condition) to judge the condition whether target frames out：

Perdict_score ＜ threshold

H/w ＞ threshold1

W/h ＞ threshold2

| x1 |/W ＜ threshold3

| W-x2 |/W ＜ threshold3

| y1 |/H ＜ threshold4

| H-y1 |/H ＜ threshold4

Wherein, h and w is respectively the height and width of object, and H and W are respectively the height and width of frame.(x1, y1) is the target upper left corner Point coordinates, (x2, y2) be the target lower right corner point coordinates.

Example 2：Quality evaluation system as shown in Figure 5.Our tracking can follow the trail of a target for a long time, and There are numerous redundancies between video frame and frame, therefore, quality is most within a period of time for the simply target that we are put in storage High, a most representative figure.

As shown in figs 6-8, it is video that one section of pedestrian goes across the road, three figures are the first frame, the 20th frame, and the 40th frame respectively Design sketch, we first to the first frame carry out pedestrian detection, the pedestrian target is tracked inside ensuing frame.One In the section time, call our quality evaluation algorithm to be assessed, select pedestrian's result storage.Here since all frames should Target is all angle of leaning to one side, therefore we choose the highest small picture storage of a quality score.

Present invention also provides a kind of intelligent apparatus, as shown in Fig. 2, it includes：

Unit for scaling, for sequence of frames of video to be zoomed to same scale；

Acquiring unit, for obtaining response maximum region as final position frame according to described search region filtering parameter.

The computing unit, is additionally operable to according to formulaFiltering parameter is obtained, wherein, w^l It is the filter parameter of passage l, ★ represents circulation relevant operation, and parameter lambda >=0 is regularization parameter,Represent the direct computation of DFT of y Leaf transformation, y^*Represent the conjugation of plural number y；

Although disclosed herein embodiment as above, the content be only readily appreciate the present invention and use Embodiment, is not limited to the present invention.Technical staff in any fields of the present invention, is taken off not departing from the present invention On the premise of the spirit and scope of dew, any modification and change, but the present invention can be carried out in the form and details of implementation Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of fast multi-target pedestrian tracking and analysis method, it is characterised in that this method includes：

Sequence of frames of video is zoomed into same scale；

Target data filtering parameter is calculated by wave filter according to the feature of the target data picture, according to the number of targets Prediction block is obtained according to filtering parameter；

Corresponding maximum region is obtained as final position frame according to described search region filtering parameter.

2. fast multi-target pedestrian tracking according to claim 1 and analysis method, according to described search region picture Feature calculates region of search filtering parameter by wave filter, and corresponding maximum region is obtained according to described search region filtering parameter As final position frame, it is characterised in that this method further includes：

Obtained according to the feature of described search region pictureAnd in the target at the heart Cheng Gaosi shapes responsive tags y∈R^M×N, wherein, M represents the width of Spatial Dimension, and N represents the height of Spatial Dimension, D representative feature passages；

According to formulaFiltering parameter is obtained, wherein, w^lIt is the filter parameter of passage l, ★ Circulation relevant operation is represented, parameter lambda >=0 is regularization parameter,Represent the discrete Fourier transform of y, y^*Represent being total to for plural number y Yoke, ⊙ represent the multiplication of matrix correspondence position；

3. fast multi-target pedestrian tracking according to claim 1 and analysis method, it is characterised in that this method is also wrapped Include：

The convolutional neural networks are the twin convolutional neural networks of three layers of shared parameter without zero padding.

4. fast multi-target pedestrian tracking according to claim 1 and analysis method, it is characterised in that this method is also wrapped Include：

The frame picture is amplified 2.5 times before the object frame region progress sectional drawing；

The prediction block by the frame picture before picture progress sectional drawing to amplifying 2.5 times.

5. a kind of intelligent apparatus, it is characterised in that it includes：

Unit for scaling, for sequence of frames of video to be zoomed to same scale；

Computing unit, for calculating target data filtering parameter by wave filter according to the feature of the target data picture, Prediction block is obtained according to the target data filtering parameter；

The sectional drawing unit, is additionally operable in the next frame, and carrying out sectional drawing to picture by the prediction block obtains region of search figure Piece；

The extraction unit, is additionally operable to extract region of search picture by convolutional neural networks the feature of region of search picture；

The computing unit, is additionally operable to calculate region of search filtering by wave filter according to the feature of described search region picture Parameter；

6. intelligent apparatus according to claim 5, it is characterised in that it is further included：

The extraction unit, the feature for being additionally operable to described search region picture obtainAnd journey at the heart in the target The responsive tags y ∈ R of Gauss shape^M×N, wherein, M represents the width of Spatial Dimension, and N represents the height of Spatial Dimension, and D representative features lead to Road；

The computing unit, is additionally operable to according to formulaFiltering parameter is obtained, wherein, w^lIt is logical The filter parameter of road l, ★ represent circulation relevant operation, and parameter lambda >=0 is regularization parameter,The discrete fourier for representing y becomes Change, y^*The conjugation of plural number y is represented, ⊙ represents the multiplication of matrix correspondence position；

The acquiring unit, is additionally operable to according to formulaResponse maximum region is obtained as final position Frame.

7. intelligent apparatus according to claim 5, it is characterised in that it is further included：

The extraction unit, the convolutional neural networks for extracting feature are the twin convolution of three layers of shared parameter without zero padding Neutral net.

8. intelligent apparatus according to claim 5, it is characterised in that it includes：

The unit for scaling, is additionally operable to that the first frame picture is amplified 2.5 times before the object frame region progress sectional drawing, the prediction Frame by the second frame picture before picture progress sectional drawing to amplifying 2.5 times.