CN110298248A

CN110298248A - A kind of multi-object tracking method and system based on semantic segmentation

Info

Publication number: CN110298248A
Application number: CN201910444189.6A
Authority: CN
Inventors: 林孝康; 张明哲; 傅嵩
Original assignee: Chongqing Gaokai Core Technology Development Co Ltd
Current assignee: Chongqing Gaokai Core Technology Development Co Ltd
Priority date: 2019-05-27
Filing date: 2019-05-27
Publication date: 2019-10-01

Abstract

The present invention provides a kind of multi-object tracking method based on semantic segmentation, which comprises reads video or image sequence, defines target position using bounding box to each frame image；Pixel-level segmentation is carried out to the bounding box by way of semantic segmentation, bounding box is divided into the different parts of background and target, and classify, obtains the location information of each classification sub-goal；Characteristic matching network is input to by adjacent two field pictures after rejecting background classification to the location information of every a kind of sub-goal of acquisition；The matching degree for calculating sub-goal feature between different frame carries out the data correlation between different frame to sub-goal, determines that sub-goal in the position of present frame, exports target trajectory.The present invention will track target using semantic segmentation and background realizes the differentiation of Pixel-level, extracts the target signature input network for rejecting background information, and the effective multiple target interaction bring that reduces influences, and improves the precision and performance of multiple target tracking.

Description

A kind of multi-object tracking method and system based on semantic segmentation

Technical field

The present invention relates to technical field of computer vision, in particular to a kind of multi-object tracking method based on semantic segmentation And system.

Background technique

In recent years, the method based on deep learning makes a breakthrough in Computer Vision Task, computer vision Technology is fast-developing, intelligent monitoring, human-computer interaction, virtual reality and augmented reality, medical imaging analyze etc. all conglomeraties and It is widely applied in field.

Target following (Object Tracking) is classical Computer Vision Task.The sense obtained by target following Interest region is basis and intelligent monitoring, human-computer interaction, the robot navigation and automatic of further progress high level visual analysis The basis of driving, virtual reality and augmented reality, medical imaging analysis etc., the accuracy of target following will directly influence calculating The performance of machine vision system.

Motion profile related question in multiple target tracking task is extremely complex.Due to the interaction between background and target, The detection of mistake, the difficulties such as too small, frame per second is low, angle, change of scale that there are targets are frequently accompanied by multiple target tracking task. Simultaneously as target is apparently similar, interaction is frequently accompanied by the interference frequently blocked with background information between target, seriously affects The precision and performance of tracking.

Therefore, for the background information interference problem occurred in the prior art, a kind of more mesh based on semantic segmentation are needed Mark tracking and system.

Summary of the invention

One aspect of the present invention is to provide a kind of multi-object tracking method based on semantic segmentation, the method packet It includes:

Video or image sequence are read, target position is defined using bounding box to each frame image；

By way of semantic segmentation to the bounding box carry out Pixel-level segmentation, by bounding box be divided into background and The different parts of target, and classify, obtain the location information of each classification sub-goal；

Spy is input to by adjacent two field pictures after rejecting background classification to the location information of every a kind of sub-goal of acquisition Levy matching network；

The matching degree for calculating sub-goal feature between different frame carries out the data correlation between different frame to sub-goal, really Stator target exports target trajectory in the position of present frame.

Preferably, by detecting to each frame image in screen or image sequence, output test result is as side Boundary's frame defines target position.

Preferably, each frame image in screen or image sequence is detected as follows:

Screen or image sequence obtain convolution characteristic pattern through convolutional Neural network model treatment with tensor form；

Based on the mode in area-of-interest pond, the first candidate region is generated on the convolution characteristic pattern, and to described First candidate region carries out bounding box recurrence processing, obtains the second candidate region；

Based on the mode in area-of-interest pond, the characteristic pattern of the second candidate region is extracted, and generates third candidate region；

Full articulamentum carries out kind judging to the third candidate region, carries out bounding box recurrence processing again, obtains side The target position that boundary's frame defines.

Preferably, by Inception V3 model, Pixel-level segmentation is carried out to the bounding box.

Preferably, by bipartite graph matching, the matching degree of sub-goal feature between different frame is calculated, sub-goal is carried out not Data correlation between at same frame determines sub-goal in the position of present frame.

Another aspect of the present invention is to provide a kind of multiple-target system based on semantic segmentation, the system packet It includes:

Module of target detection: for reading video or image sequence, target institute is defined using bounding box to each frame image In position；

Semantic segmentation module: for carrying out Pixel-level segmentation to the bounding box by way of semantic segmentation, by boundary Frame is divided into the different parts of background and target, and classifies, and obtains the location information of each classification sub-goal；

Characteristic matching network: to the location information of every a kind of sub-goal of acquisition, after rejecting background classification, by adjacent two frame Image is input to characteristic matching network；

A kind of multi-object tracking method and system based on semantic segmentation provided by the invention will be tracked using semantic segmentation Target and background realize the differentiation of Pixel-level, to extract the target signature input network for rejecting background information, effectively subtract Few multiple target interaction bring influences, and improves the precision and performance of multiple target tracking.

It should be appreciated that aforementioned description substantially and subsequent detailed description are exemplary illustration and explanation, it should not As the limitation to the claimed content of the present invention.

Detailed description of the invention

With reference to the attached drawing of accompanying, the more purposes of the present invention, function and advantage are by the as follows of embodiment through the invention Description is illustrated, in which:

Fig. 1 has schematically shown a kind of flow diagram of multi-object tracking method based on semantic segmentation of the invention.

Fig. 2 shows the flow diagrams that the present invention defines target position.

Fig. 3 shows the schematic diagram that bounding box of the present invention carries out Pixel-level segmentation.

Fig. 4 shows Inception V3 model schematic in one embodiment of the invention.

Fig. 5, which is shown, to be distributed in bright one embodiment using bipartite graph matching method to the number between sub-goal progress different frame According to associated schematic diagram.

Fig. 6 shows a kind of structural block diagram of the multiple-target system based on semantic segmentation of the present invention.

Specific embodiment

By reference to exemplary embodiment, the purpose of the present invention and function and the side for realizing these purposes and function Method will be illustrated.However, the present invention is not limited to exemplary embodiment as disclosed below；Can by different form come It is realized.The essence of specification is only to aid in those skilled in the relevant arts' Integrated Understanding detail of the invention.

Hereinafter, the embodiment of the present invention will be described with reference to the drawings.In the accompanying drawings, identical appended drawing reference represents identical Or similar component or same or like step.

Detailed description is provided to the contents of the present invention below by specific embodiment, the present invention is a kind of as shown in Figure 1 The flow diagram of multi-object tracking method based on semantic segmentation, embodiment according to the present invention are a kind of based on the more of semantic segmentation Method for tracking target includes following method and step:

Step S101 defines target position.

Video or image sequence are read, target position is defined using bounding box to each frame image.

According to an embodiment of the invention, by being detected to each frame image in screen or image sequence, output inspection Result is surveyed as bounding box and defines target position.

In some embodiments, the present invention as shown in Figure 2 defines the flow diagram of target position, to screen or figure As each frame image in sequence is detected as follows:

S201, screen or image sequence obtain convolution characteristic pattern through convolutional Neural network model treatment with tensor form.

S202, the mode based on area-of-interest pond generate the first candidate region on the convolution characteristic pattern, and right First candidate region carries out bounding box recurrence processing, obtains the second candidate region.

S203, the mode based on area-of-interest pond, extract the characteristic pattern of the second candidate region, and generate third candidate Region.

S204, full articulamentum carry out kind judging to third candidate region, carry out bounding box recurrence processing again, obtain side The target position that boundary's frame defines.

Step S102, Pixel-level segmentation is carried out to bounding box.

By way of semantic segmentation to the bounding box carry out Pixel-level segmentation, by bounding box be divided into background and The different parts of target, and classify, obtain the location information of each classification sub-goal.

By taking pedestrian as an example, the pedestrian's bounding box detected is interior to further comprise a large amount of background interference information other than target, Interference in order to avoid background information to tracking result carries out Pixel-level segmentation to pedestrian's (target) bounding box, will test Pedestrian's bounding box is divided into background and human body different parts.Bounding box of the present invention carries out the schematic diagram of Pixel-level segmentation as shown in Figure 3, After being divided by semantic segmentation to bounding box, bounding box is divided into five background, head, the upper part of the body, the lower part of the body and shoes classifications, Head, the upper part of the body, the lower part of the body and shoes are the sub-goal for tracking target.

In the present embodiment, by Inception V3 model, Pixel-level segmentation is carried out to the bounding box.

Inception V3 model schematic in one embodiment of the invention as shown in Figure 4, is based on Inception V3Module carries out semantic segmentation.Inception V3 is the network with excellent local topology, i.e., simultaneously to input picture Multiple convolution algorithms or pondization operation are executed capablely, and all output results are spliced into a very deep characteristic pattern.Because 1*1,3*3 or 5*5 different convolution algorithms can obtain the different information of input picture from pondization operation, and parallel processing simultaneously combines The result of these operations will obtain better image characterization.In the present embodiment, the convolution of 3*3 can split into 1*3 and 3*1 convolution, More save parameter.

Step S103, the data correlation between different frame is carried out to sub-goal, determine sub-goal in the position of present frame, it is defeated The motion profile of target out.

Spy is input to by adjacent two field pictures after rejecting background classification to the location information of every a kind of sub-goal of acquisition Levy matching network.Semantic segmentation obtains sub-goal classification, and after rejecting background classification information, sub-goal feature is more excellent at this time, will scheme As being input to characteristic matching network, the robustness of algorithm is effectively enhanced.

The matching degree for calculating sub-goal feature between different frame carries out the data correlation between different frame to sub-goal, really Stator target exports the motion profile of target in the position of present frame.

Such as in above-described embodiment, spy is input to by adjacent two frames picture after rejecting background classification to the target of tracking It levies in matching network.The matching degree for calculating sub-goal head between different frame, the upper part of the body, the lower part of the body and shoes, between different frame Data correlation.Such as the 1st frame image sub-goal head, carry out data pass with the sub-goal head of the 2nd frame image Connection；……；The sub-goal shoes and the sub-goal shoes of the 2nd frame image of 1st frame image are associated.

According to an embodiment of the invention, the matching degree of sub-goal feature between different frame is calculated by bipartite graph matching, it is right Sub-goal carries out the data correlation between different frame, determines sub-goal in the position of present frame.The bright implementation of distribution shown in Fig. 5 Using bipartite graph matching method to the schematic diagram of the data correlation between sub-goal progress different frame in example.

In some embodiments, bipartite graph matching is carried out as follows:

Input adjacent two field pictures I₁And I₂, construct graph model G₁(V₁,E₁) and G₂(V₂,E₂).Wherein, | V₁|=n, | V₂| =m, respectively sub-goal quantity, E detected by two field pictures₁,E₂The side collection of respectively two graph models.

If indicator vector v ∈ { 0,1 }^nm×1, when belonging to V₁In node i (sub-goal i) and belong to V₂In node a (specific item When mark a) matches, it is directed toward vector v_ia=1.

Establish a square symmetrical positive matrices M ∈ R^nm×nm, so that M_ij,abEach pair of sub-goal can be measured in corresponding graph model The matching of side collection, i.e. (i, j) ∈ E₁With (a, b) ∈ E₂Matching.Wherein, M ∈ R^nm×nmIndicate that each element in positive matrices belongs to reality Manifold R^nm×nm, M_ij,abIndicate the submatrix in symmetrical positive matrices M.

For not forming pair on side, their respective items in a matrix are set as 0.Optimal assignment v^*It can indicate are as follows:

v^*argmaxv^TMv, so that Cv=1, v ∈ { 0,1 }^nm×1 (1)

Wherein argmax indicates argmax function.

Matrix C is used to constrain one-to-one matching, it may be assumed that

(1) formula is converted to (3) formula to solve:

v^*argmaxv^TMv, so that | | v | |₂=1 (3)

It can be in the hope of optimal v according to the feature vector of matrix M^*。As node i (sub-goal i) matched node a (the confidence level of sub-goal a).According to calculated matching confidence level, available final matching result, realize sub-goal into Data correlation between row different frame obtains the current location of sub-goal.

In further embodiments, bipartite graph matching can also be carried out as follows:

A. depth characteristic extracts (Deep Feature Extractor):

For adjacent two field pictures I₁And I₂, the target signature U from different layers that is extracted by feature extraction network¹, U², F¹, F²。

B. affinity matrix calculates (Affinity Matrix Calculation)

According to the connection matrix A of known figure₁, A₂, decompose and obtain matrix G₁, G₂, H₁, H₂, wherein

The feature for defining corresponding sides is X, Y, connects to obtain by matrix.

M_pIt indicates the similarity between the associated energy of the intermediate node of two figures i.e. two points, is obtained by the feature inner product put It arrives, it may be assumed that

M_p=U¹U^2T,

M_eIt indicates two simultaneous energy in side in two width figures, therefore is obtained by the feature inner product on side, wherein Λ is to need The parameter of study.That is:

M_e=X Λ Y^T,

Metzler matrix can be solved by following formula, and wherein vec () indicates matrix pulling into row vector by column, and [] is indicated will be to Quantitative change diagonally battle array.

C. power iteration (Power Iteration)

The feature vector of matrix can be by power iteration (Power Iteration) approximation, and iterative formula is as follows:

Obtain v^*Afterwards, by ranks l₁It is unitization, give initial matrix S₀=(v^*)_n×m

D. loss function (Loss Function)

The position of prediction and the gap of actual position will be calculated as loss, defined:

Weighted average obtains the deviation of predicted position and actual position, realizes weight normalization using softmax.

Wherein,Indicate the offset distance of origin-to-destination in true match,

The present invention provides a kind of multiple-target system based on semantic segmentation, and the present invention is a kind of as shown in Figure 6 is based on language The structural block diagram of the multiple-target system of justice segmentation, according to an embodiment of the invention, a kind of multiple target based on semantic segmentation Tracking system includes:

Module of target detection 100: for reading video or image sequence, target is defined using bounding box to each frame image Position.

In some embodiments, by being detected to each frame image in screen or image sequence, output detection knot Fruit defines target position as bounding box.Each frame image in screen or image sequence is examined as follows It surveys:

Screen or image sequence obtain convolution characteristic pattern through convolutional Neural network model treatment with tensor form.

Based on the mode in area-of-interest pond, the first candidate region is generated on the convolution characteristic pattern, and to described First candidate region carries out bounding box recurrence processing, obtains the second candidate region.

Based on the mode in area-of-interest pond, the characteristic pattern of the second candidate region is extracted, and generates third candidate region.

Semantic segmentation module 200: for carrying out Pixel-level segmentation to the bounding box by way of semantic segmentation, by side Boundary's frame is divided into the different parts of background and target, and classifies, and obtains the location information of each classification sub-goal.

According to an embodiment of the invention, carrying out Pixel-level segmentation to bounding box by Inception V3 model.

Characteristic matching network 300: will be adjacent after rejecting background classification to the location information of every a kind of sub-goal of acquisition Two field pictures are input to characteristic matching network.

The matching degree for calculating sub-goal feature between different frame carries out the data correlation between different frame to sub-goal, really Stator target is in the position of present frame, the motion profile of final output sub-goal.

According to an embodiment of the invention, the matching degree of sub-goal feature between different frame is calculated by bipartite graph matching, it is right Sub-goal carries out the data correlation between different frame, determines sub-goal in the position of present frame.

In conjunction with the explanation and practice of the invention disclosed here, the other embodiment of the present invention is for those skilled in the art It all will be readily apparent and understand.Illustrate and embodiment is regarded only as being exemplary, true scope of the invention and purport are equal It is defined in the claims.

Claims

1. a kind of multi-object tracking method based on semantic segmentation, which is characterized in that the described method includes:

Pixel-level segmentation is carried out to the bounding box by way of semantic segmentation, bounding box is divided into background and target Different parts, and classify, obtain the location information of each classification sub-goal；

Feature is input to by adjacent two field pictures after rejecting background classification to the location information of every a kind of sub-goal of acquisition Distribution network；

The matching degree for calculating sub-goal feature between different frame carries out the data correlation between different frame to sub-goal, determines son Target exports target trajectory in the position of present frame.

2. the method according to claim 1, wherein by each frame image in screen or image sequence into Row detection, output test result define target position as bounding box.

3. according to the method described in claim 2, it is characterized in that, to each frame image in screen or image sequence according to such as Lower method is detected:

Full articulamentum carries out kind judging to the third candidate region, carries out bounding box recurrence processing again, obtains bounding box The target position defined.

4. the method according to claim 1, wherein by Inception V3 model, to the bounding box into The segmentation of row Pixel-level.

5. the method according to claim 1, wherein calculating sub-goal between different frame by bipartite graph matching The matching degree of feature carries out the data correlation between different frame to sub-goal, determines sub-goal in the position of present frame.

6. a kind of multiple-target system based on semantic segmentation, which is characterized in that the system comprises:

Module of target detection: for reading video or image sequence, target institute is defined in place using bounding box to each frame image It sets；

Semantic segmentation module: for carrying out Pixel-level segmentation to the bounding box by way of semantic segmentation, bounding box is drawn It is divided into the different parts of background and target, and classifies, obtains the location information of each classification sub-goal；

Characteristic matching network: to the location information of every a kind of sub-goal of acquisition, after rejecting background classification, by adjacent two field pictures It is input to characteristic matching network；

7. system according to claim 6, which is characterized in that by each frame image in screen or image sequence into Row detection, output test result define target position as bounding box.

8. system according to claim 6, which is characterized in that each frame image in screen or image sequence according to such as Lower method is detected:

9. system according to claim 6, which is characterized in that by Inception V3 model, to the bounding box into The segmentation of row Pixel-level.

10. system according to claim 6, which is characterized in that by bipartite graph matching, calculate sub-goal between different frame The matching degree of feature carries out the data correlation between different frame to sub-goal, determines sub-goal in the position of present frame.