CN108564014A

CN108564014A - Object shapes tracks of device and method and image processing system

Info

Publication number: CN108564014A
Application number: CN201810288618.0A
Authority: CN
Inventors: 陈存建; 黄耀海; 赵东悦; 金浩
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2017-04-17
Filing date: 2018-04-03
Publication date: 2018-09-21
Anticipated expiration: 2038-04-03
Also published as: CN108564014B

Abstract

The invention discloses a kind of object shapes tracks of device and method and image processing systems.The device of the tracking object shape includes：It is configured as determining the unit of the object shapes in current video frame based on the object shapes at least one preceding video frame；It is configured as, based on the block information of the object shapes at least one preceding video frame, determining the unit of the block information of the object shapes of the determination；It is configured as block information based on the determination, updates the unit of the object shapes of the determination；And be configured as being based on the newer object shapes, update the unit of the block information of the determination.According to the present invention, in the case that the object during object shapes in tracking video in video is blocked by other objects, the accuracy of object shapes will be improved and to the accuracy of image tracing.

Description

Object shapes tracks of device and method and image processing system

Technical field

The present invention relates to image procossings, the more particularly, to device and method of tracking object shape and image procossing system System.

Background technology

During object (the especially object shapes) in tracking video, the face or human body in video are such as tracked Joint, the object shapes in a video frame (for example, current video frame) in order to more accurately obtain video, usually using from Object shapes that the preceding video frame of the video frame determines initialize the object original shape in the video frame.It then, can base The object final shape in the video frame is determined in the object original shape of initialization.

“Facial Shape Tracking via Spatio-Temporal Cascade Shape Regression” (J.Yang.J.Deng,K.Zhang,and Q.Liu.,the IEEE International Conference on Computer Vision (ICCV) Workshops, 2015, pp.41-49) a kind of example technique is disclosed in.This is exemplary Technology essentially discloses following procedure：It, first, will be true from the preceding video frame of current video frame for the current video frame of video Fixed object shapes are considered as the object original shape in current video frame；Then, shape is executed to the object original shape to return Return method (for example, cascade shape returns (Cascaded Shape Regression, CSR) method), to determine current video frame In object final shape.These processes are repeated, until reaching video end.

In other words, during the object shapes in tracking video, the object shapes determined from preceding video frame will be by It is transmitted to subsequent video frame, with the corresponding object original shape of determination.That is, the object shapes determined from preceding video frame Accuracy will directly affect the even entire video of the subsequent video frame to be determined object shapes accuracy.However, in root It is determined in a video frame during corresponding object shapes according to above-mentioned technology, only accounts for the previous video from the video frame Object shapes , that frame determines Er No has consideration other information.Therefore, object in video by other objects (such as, mask, Sunglasses, scarf, microphone, hand or people) block in the case of, the process of corresponding object shapes in determining a video frame In, do not consider that this influences the object final shape in the video frame obtained will to be caused inaccurate caused by blocking.In other words, During according to object (especially object shapes) in above-mentioned Technical Follow-Up video, object in video is right by other In the case of blocking, this, which is blocked, will influence the accuracy of the object tracking result of the even entire video of a video frame.

Invention content

Therefore, in view of the record in background technology above, the disclosure aims to solve the problem that the above problem.

According to an aspect of the present invention, a kind of device for tracking the object shapes in video, described device are provided Including：Shape determining unit, pair being configured as at least one preceding video frame of the current video frame based on input video Pictograph shape determines the object shapes in the current video frame；Information determination unit is configured as being based at least one elder generation The block information of object shapes in preceding video frame, determine the object shapes determined by the shape determining unit blocks letter Breath；Shape updating unit, is configured as the block information determined based on described information determination unit, and update is determined by the shape The object shapes that unit determines；And information updating unit, it is configured as newer to pictograph based on the shape updating unit Shape updates the block information determined by described information determination unit.It wherein, should for any one video frame of input video The Based on Feature Points of object shapes is to block characteristic point and unshielding characteristic point by the block information of the object shapes in video frame.

Utilize the present invention, during object (the especially object shapes) in tracking video, object in video In the case of being blocked by other objects (such as, mask, sunglasses, scarf, microphone, hand or people), object shapes will be improved Accuracy and accuracy to image tracing.

According to description referring to the drawings, other property features of the invention and advantage will be evident.

Description of the drawings

Including in the description and the attached drawing that forms part of this specification illustrates the embodiment of the present invention, and and word Principle for explaining the present invention together is described.

Figure 1A to Figure 1B schematically shows the example object blocked by other objects in video.

Fig. 2 is the block diagram for the hardware configuration for schematically showing achievable technology according to an embodiment of the invention.

Fig. 3 is the block diagram for the configuration for illustrating object shapes tracks of device according to an embodiment of the invention.

Fig. 4 schematically shows the flow chart of object shapes tracking according to an embodiment of the invention.

Fig. 5 A to Fig. 5 B are shown schematically in the occlusion area detected in subject area.

Fig. 6 schematically shows the flow chart of step S420 as shown in Figure 4 according to the present invention.

Fig. 7 A to Fig. 7 D schematically show the object determined by step S420 shown in Fig. 6 in t-th of video frame The example of the block information of shape.

Fig. 8 illustrates the arrangement of example image processing system according to the present invention.

Fig. 9 A to Fig. 9 B are shown schematically in illustrative two people in crowded admission scene.

Figure 10 A to Figure 10 C are shown schematically in illustrative two people in another crowded admission scene.

Specific implementation mode

Describe exemplary embodiment of the present invention in detail below with reference to accompanying drawings.It should be noted that following description is substantial It is only illustrative and exemplary of, and it is in no way intended to limit the present invention and its application or purposes.Unless otherwise expressly specified, no Then the positioned opposite of component and step, numerical expression and numerical value described in embodiment are not limit the scope of the invention.Separately Outside, technology known to those skilled in the art, method and apparatus may not be discussed in detail, but in situation appropriate It should be the part of this specification.

It note that similar reference numeral and letter refer to similar project in attached drawing, therefore, once a project is one It is defined, then need not discuss to it in following attached drawing in a attached drawing.

As described above, according to the prior art track video in object (especially object shapes) during, regarding In the case that object in frequency is blocked by other objects (such as, mask, sunglasses, scarf, microphone, hand or people), in determination In one video frame of video when corresponding object shapes, do not consider that this blocks generated influence.For example, Figure 1A is schematically Show that the exemplary face blocked by mask, i.e. face shape are blocked by mask.Figure 1B is shown schematically in crowded admission The people blocked by other people in scene (a crowded and walk-through scenario), i.e., the shape quilt of one people Other people shape is blocked, and is entered jointly towards the shooting direction of camera into one wherein crowded admission scene is, for example, more people The scene of mouth.In general, during the object in tracking video according to the prior art, will be led due to being blocked present in video The object shapes (that is, existing block will influence the accuracy of object shapes) for causing output inaccurate, and due to existing in video Block and will also lead to lose the object that is tracking or the tracking identification number (tracking of object that switching is tracking Identification, ID) (that is, existing accuracy blocked by influence to image tracing), therefore exist correspond in video In the case of blocking, in order to reduce it is existing block generated influence, those skilled in the art would generally consider how It is blocked existing for being removed as far as possible in the process to image tracing.

However, it is found by the inventors that during the object shapes in tracking video, pair in determining corresponding video frame When pictograph shape, block information present in video is also used as referring to well.Therefore, the object shapes in tracking video During, in the case that object in video is blocked by other objects, the present invention is not to consider how to remove existing screening Gear, but consider how to block existing for assist carrying out the tracking of object shapes.

Therefore, corresponding right in a video frame for determining video during the object shapes in tracking video When pictograph shape, other than the object shapes determined from preceding video frame are transmitted to the video frame, the present invention will also be from previous The block information that video frame determines is transmitted to the video frame.Wherein, the object shapes transmitted are for determining in the video frame Object original shape, and block information of the block information transmitted for determining identified object original shape.Wherein, needle To any one video frame of video, the Based on Feature Points of object shapes is by the block information of the object shapes in the video frame Block characteristic point and unshielding characteristic point.In addition, the characteristic point of object shapes namely the index point of object shapes, and characteristic point For example, human face characteristic point or human synovial characteristic point.

The object final shape in a video frame in order to determine video, based on the object original shape in the video frame Correspondence block information, can respectively use different method upgating object original shapes shield portions (that is, the feature blocked Point) position and upgating object original shape unshielding part (that is, unshielding characteristic point) position.In addition, should determining After object final shape in video frame, the object that will be updated based on the object final shape in the video frame in the video frame The correspondence block information of original shape so that the block information for being transmitted to subsequent video frame is more acurrate.Therefore, on the one hand, true Generated influence is blocked existing for being prevented during the final position for determining unshielding part, will make the position of unshielding part It sets more acurrate.On the other hand, the final position of shield portions is determined using the position of more accurate unshielding part, it can be minimum It is blocked existing for change on being influenced caused by the accuracy of the position of shield portions.Therefore, according to the present invention, in tracking video Object shapes during, in the case that object in video is blocked by other objects, a video frame can be improved even The accuracy of the object shapes of entire video and the accuracy of object tracking result.

(hardware configuration)

First by with reference to Fig. 2 description may be implemented hereafter described in technology hardware configuration.

Hardware configuration 200 for example including central processing unit (CPU) 210, random access memory (RAM) 220, read-only deposit Reservoir (ROM) 230, hard disk 240, input equipment 250, output equipment 260, network interface 270 and system bus 280.In addition, hard Part configuration 200 can pass through such as camera, personal digital assistant (PDA), mobile phone, tablet computer, laptop, desktop Brain or other suitable electronic equipments are realized.

In the first realization method, the process of the object shapes in video is tracked by hardware or firmware configuration according to the present invention And the module or component as hardware configuration 200.For example, being used as firmly hereinafter with reference to the device 300 of Fig. 3 detailed descriptions The module or component of part configuration 200.In the second realization method, according to the present invention track video in object shapes process by It is stored in the software configuration executed in ROM230 or hard disk 240 and by CPU 210.For example, being retouched in detail hereinafter with reference to Fig. 4 The process 400 stated is used as the program being stored in ROM 230 or hard disk 240.

CPU 210 is any suitable programmable control device (such as, processor), and be may be implemented within Various application programs in ROM 230 or hard disk 240 (such as, memory) execute the various functions being described hereinafter.RAM 220 programs or data loaded from ROM 230 or hard disk 240 for interim storage, and held wherein used also as CPU 210 The sky of row various processes (such as, implementing the technology being described in detail hereinafter with reference to Fig. 4 and Fig. 6) and other available functions Between.Hard disk 240 stores much information, and such as, operating system (OS), control program, is pre-stored by manufacturer or in advance various applications It the data of definition and is pre-stored by manufacturer or the model and/or grader of pre-generatmg.

In one implementation, input equipment 250 is for allowing user to be interacted with hardware configuration 200.In an example In, user can pass through 250 input pictures of input equipment/video/data.In another example, user can be set by input The corresponding process of the standby 250 triggering present invention.In addition, various forms can be used in input equipment 250, and such as, button, keyboard or touch Screen.In another implementation, input equipment 250 is for receiving from digital camera, video camera and/or network cameras etc. The image/video of specialized electronic equipment output.

In one implementation, output equipment 260 is used to show object tracking result (such as, tracking object to user Bounding box, the shape of tracking object, the hiding relation etc. between two tracking objects).Moreover, output equipment 260 can be used Various forms, such as, cathode-ray tube (CRT) or liquid crystal display.In another implementation, output equipment 260 is used for Object tracking result is output to the subsequent process of video/image analysis and identification, such as, face analysis, portrait retrieval, expression Identification, face recognition, facial Attribute Recognition etc..

Network interface 270 provides the interface for hardware configuration 200 to be connected to network.For example, hardware configuration 200 can be through By network interface 270 and other electronic equipments connected via a network into row data communication.It is alternatively possible to be hardware configuration 200 provide wireless interface, to carry out wireless data communication.System bus 280 can provide for CPU 210, RAM 220, The data of mutual data transmission between ROM 230, hard disk 240, input equipment 250, output equipment 260 and network interface 270 etc. Transmission path.Although being referred to as bus, system bus 280 is not limited to any specific data transmission technology.

Above-mentioned hardware configuration 200 is merely illustrative, and is in no way intended to limit invention, its application, or uses.And And for brevity, a hardware configuration is only shown in fig. 2.But it is also possible to be matched as needed using multiple hardware It sets.

(object shapes tracks of device and method)

Next, by the process of the object shapes tracked according to the present invention with reference to Fig. 3 to Fig. 7 D descriptions in video.

Fig. 3 is the block diagram for the configuration for illustrating devices in accordance with embodiments of the present invention 300.Wherein, shown in Fig. 3 some Or whole modules can be by dedicated hardware realization.As shown in figure 3, device 300 includes shape determining unit 310, information determination list Member 320, shape updating unit 330 and information updating unit 340.

First, input equipment 250 shown in Fig. 2, which is received, exports from specialized electronic equipment (for example, camera) or by user The video of input.Then, input equipment 250 via system bus 280 by the transmission of video received to device 300.

Then, as shown in Figure 3, for (such as, t-th of the current video frame of the video (that is, input video) received Video frame) in object, at least one preceding video frame of the shape determining unit 310 based on current video frame to pictograph Shape determines the object shapes in current video frame.Wherein, t be natural number and 2≤t ＜ T, T be input video video frame it is total Number.In other words, shape determining unit 310 determines current video frame based at least one object shapes transmitted from preceding video frame In object original shape.Wherein, the object shapes to be tracked are such as face shape or human synovial shape.

Block information of the information determination unit 320 based on the object shapes at least one preceding video frame is determined by shape The block information for the object shapes that shape determination unit 310 determines.In other words, information determination unit 320 is based on from preceding video frame The block information of transmission determines the block information of the object original shape in current video frame.

The block information that shape updating unit 330 is determined based on information determination unit 320 is updated by shape determining unit 310 object shapes determined.In other words, shape updating unit 330 is worked as by the block information update based on object original shape Object original shape in preceding video frame, to determine the object final shape in current video frame.

Information updating unit 340 is based on 330 newer object shapes of shape updating unit, updates by information determination unit 320 block informations determined.In other words, information updating unit 340 is based on the object final shape in current video frame, update pair As the block information of original shape.

That is, for t-th of video frame (wherein, t >=2) of input video, device 300 will be used and be regarded from t-th The shape information and block information that the preceding video frame of frequency frame determines, come determine the correspondingly-shaped information in t-th of video frame and Corresponding block information.In addition, in order to trigger object shapes tracking process and determine first video frame of input video (that is, the 1st A video frame) in correspondingly-shaped information and corresponding block information, device 300 further includes detection unit 350.

As shown in Figure 3, for the 1st video frame of input video, detection unit 350 detects right in the 1st video frame The object shapes answered, and detect the correspondence block information of detected object shapes in the 1st video frame.Then, with defeated For the 2nd video frame for entering video, shape determining unit 310 and information determination unit 320 are based on being examined from the 1st video frame The shape information and block information measured, executes corresponding operation.

As described above, being directed to an input video, detection unit 350 will be only from the 1st video frame of the input video, inspection Survey corresponding shape information and corresponding block information.In addition, in order to prevent because of the position of object shapes in entire input video Tracking error caused by offset accumulates and in order to improve the accuracy of object shapes tracking, first can be from entire input video Obtain several sequence of frames of video.Then, for the 1st video frame of each sequence of frames of video, detection unit 350 will execution pair The operation answered.For t-th of video frame (wherein, t >=2) of each sequence of frames of video, shape determining unit 310, information are true Order member 320, shape updating unit 330 and information updating unit 340 will execute corresponding operation.In addition, in an example, Detection unit 350 will be used to obtain corresponding sequence of frames of video from entire input video.In another example, other unit (examples Such as, unshowned retrieval unit in Fig. 3) it can also be used for obtaining corresponding sequence of frames of video.

Flow chart 400 shown in Fig. 4 is the corresponding process of device 300 shown in Fig. 3.

As shown in Figure 4, for an input video, in detecting step S410, detection unit 350 detects input video The 1st video frame in corresponding object shapes, and detect the corresponding of detected object shapes in the 1st video frame and hide Keep off information.As described above, as optional solution, detection unit 350 is from the video frame sequence obtained from input video 1st video frame of row, detects corresponding object shapes and corresponding block information.In one implementation, detection unit 350 detect correspondingly-shaped information and corresponding block information in the 1st video frame by following procedure.

On the one hand, detection unit 350 executes shape detecting method (for example, cascade homing method to the 1st video frame (cascaded regression method)), to detect corresponding object shapes in the 1st video frame, therefore it can get the 1st The corresponding position of the characteristic point of object shapes in a video frame.For example, in the case where the object to be tracked is face, feature Point is human face characteristic point；And in the case where the object to be tracked is human synovial, characteristic point is human body joint characteristic point.

On the other hand, detection unit 350 executes occlusion detection method to the 1st video frame, to detect detected pair The correspondence block information of pictograph shape.In an example, occlusion detection method is the matching process based on template.Wherein, it is used for The template of matching operation is such as including mask template, scarf template, sunglasses template.In another example, occlusion detection side Method is object detection and sorting technique based on model.Wherein, for detecting and the model of sort operation is for example based on blocking sample This is generated using deep learning method, and the model is for example regarded for detecting the position of occlusion area and identification in video frame The classification blocked in frequency frame.

In one implementation, detection unit 350 is by detecting subject area (for example, rectangle region shown in Fig. 5 A Domain 510) in occlusion area (for example, mask region 520 shown in Fig. 5 A), it is detected to detect in the 1st video frame Object shapes block information.It wherein, in an example, can be based on the object tracking result obtained from preceding video frame (for example, object shapes) estimate subject area.It in another example, can be by using existing detection method in corresponding video Subject area is detected in frame.Further, the characteristic point being located in the object shapes detected in occlusion area is considered as blocking Characteristic point, and the characteristic point being located in the object shapes detected outside occlusion area is considered as unshielding characteristic point.In other words, needle To any one video frame of input video, the block informations of the object shapes in the video frame is by the characteristic point table of object shapes It is shown as blocking characteristic point and unshielding characteristic point.That is, block information indicates the screening of each characteristic point of object shapes Gear state.

In addition, the block information of the object shapes in a video frame carrys out table using binary representation or probability representation Show.Wherein, binary representation means that the occlusion state that each is blocked to characteristic point is expressed as " 1 ", and by each non-screening The occlusion state of gear characteristic point is expressed as " 0 ".Probability representation means to describe blocking for each characteristic point using probability value State.For example, in the case where the probability value of a characteristic point is greater than or equal to predetermined threshold (for example, TH1), this feature point will It is considered as one and blocks characteristic point.

In addition, in order to obtain more accurate occlusion area in subject area, to can get more accurate block information, inspection It surveys unit 350 and image partition method is executed to detected occlusion area (for example, mask region 520 shown in Fig. 5 A), To obtain more accurate occlusion area (for example, mask region 530 shown in Fig. 5 B).In one implementation, with Fig. 5 A Shown in for mask region 520, by executing convolutional neural networks to each pixel in mask region 520 (Convolution Neural Network, CNN) algorithm realizes image partition method.By by mask shown in Fig. 5 A Updated mask region 530 shown in region 520 and Fig. 5 B is compared, it can be seen that by updating mask region 520, the occlusion state of the characteristic point around nasal area will be updated to unshielding characteristic point from characteristic point is blocked.

It is back to Fig. 4, in the step s 420, for t-th of video frame (wherein, t >=2) of input video, shown in Fig. 3 Device 300 determine t-th of video frame in corresponding object shapes and corresponding block information.In one implementation, it fills It sets 300 and determines corresponding informance with reference to Fig. 6.

Then, it is determined in t-th of video frame after corresponding object shapes and corresponding block information in device 300, In step S430, device 300 judges whether t is more than T.In the case where t is more than T (meaning that entire input video has been handled), The corresponding process of device 300 will stop.Otherwise, in step S440, device 300 is arranged t=t+1 and repeats step S420's Corresponding process.

Fig. 6 schematically shows the flow chart of step S420 as shown in Figure 4 according to the present invention.As shown in Figure 6, exist Shape determines in step S421, at least one first forward sight of the shape determining unit 310 shown in Fig. 3 based on t-th of video frame Object shapes in frequency frame determine the object shapes (that is, object original shape) in t-th of video frame.

In one implementation, shape determining unit 310 directly will be in a first forward sight nearest away from t-th of video frame The object shapes determined in frequency frame are considered as the object original shape in t-th of video frame.In another implementation, shape is true The average value for the object shapes that order member 310 is determined by calculating in multiple preceding video frames of t-th of video frame or weighting Summation, to determine the object original shape in t-th of video frame.

It is back to Fig. 6, in information determining step S422, information determination unit 320 is based at least one preceding video frame In object shapes block information, determine the block information of the object original shape in t-th of video frame.

In one implementation, information determination unit 320 directly will be in a first forward sight nearest away from t-th of video frame The block information of the object shapes determined in frequency frame is considered as the block information of the object original shape in t-th of video frame.

In another implementation, in order to obtain the accurate block information of the object original shape in t-th of video frame, Information determination unit 320 is made based on the block information of the object shapes determined in multiple preceding video frames of t-th of video frame The block information of the object original shape in t-th of video frame is determined with Statistics-Based Method.

In an example, information determination unit 320 is by calculating in multiple preceding video frames of t-th of video frame really The average value or weighted sum of the block information of fixed object shapes, to determine the object original shape in t-th of video frame Block information.In other words, letter is blocked using the object shapes determined from (t-n) a video frame to (t-1) a video frame Breath, to determine that the block information of the object original shape in t-th of video frame, wherein n are natural number and 2≤n ＜ T.

For calculating weighted sum, due to being determined in away from the closer preceding video frame of t-th of video frame to pictograph The block information of shape can preferably describe the block information of the object original shape in t-th of video frame, therefore, in order to obtain The more accurate block information of object original shape in t-th of video frame, will be for the previous video closer away from t-th of video frame Frame assigns larger weighted value, and is to assign smaller weighted value away from the preceding video frame of t-th of video frame farther out.For example, false If n=6, then it can be that (t-1) a video frame to (t-3) a video frame assigns weighted value 0.8, can be (t-4) a video Frame to (t-6) a video frame assigns weighted value 0.2.It will be appreciated by those skilled in the art that above-mentioned example is only illustrative , rather than it is restrictive.After assigning corresponding weighted value for each preceding video frame, it is total that corresponding weighting will be calculated With.In the case that the block information of object shapes in the video frame is indicated using binary representation, as described above, by each A occlusion state for blocking characteristic point is expressed as " 1 ", and the occlusion state of each unshielding characteristic point is expressed as " 0 ".Cause This, the occlusion state that weighted sum in t-th of video frame is greater than or equal to the characteristic point of predetermined threshold (for example, TH2) indicates For " 1 ", and the occlusion state that weighted sum in t-th of video frame is less than to the characteristic point of predetermined threshold (for example, TH2) is expressed as “0”.In the case that the block information of object shapes in the video frame is indicated using probability representation, as described above, will use Corresponding probability value describes the occlusion state of each characteristic point.Therefore, it will indicate to regard for t-th using corresponding weighted sum The correspondence probability value of characteristic point in frequency frame.

In another example, the case where block information of object shapes in the video frame is indicated using probability representation Under, in order to obtain the more accurate block information of the object original shape in t-th of video frame, information determination unit 320 passes through Machine learning method (example is executed to the block information of the object shapes determined in multiple preceding video frames of t-th of video frame Such as, Hidden Markov Model (Hidden Markov Model, HMM)), to determine the object original shape in t-th of video frame Block information.Then, the respective value obtained based on machine learning method will be used to indicate the characteristic point in t-th of video frame Correspondence probability value.

In another implementation, in order to reduce calculation amount, information determination unit 320 is based in t-th video frame The stability of the block information of the object shapes determined in multiple preceding video frames, determines that the object in t-th of video frame is initial The block information of shape.More specifically, in the case that the block information of the object shapes in multiple preceding video frames is stablized, The block information of object shapes in any one preceding video frame is considered as in t-th of video frame by information determination unit 320 The block information of object original shape.In other words, the block information stabilization of the object shapes in multiple preceding video frames (means Blocking for occurring in input video is blocked to synchronize) in the case of, object shapes in each preceding video frame block letter Cease all same.It therefore, can be by the object shapes in any one preceding video frame instead of executing above-mentioned Statistics-Based Method Block information is used as the block information of the object original shape in t-th of video frame.

In addition, on the one hand, it can implement to judge operation by information determination unit 320 or specialized units (being not shown in Fig. 3), with Judge whether the block information of the object shapes in multiple preceding video frames is stablized.

On the other hand, in an example, the screening of the object shapes in multiple preceding video frames is determined based on experience setting Whether gear information is stablized.For example, the object to be tracked in input video by non-moving object (for example, mask, sunglasses, enclosing Towel) block in the case of, the block information of the object shapes in multiple preceding video frames will be considered as stable.

In another example, in order to reduce calculation amount and obtain t-th of video frame in object original shape it is more acurrate Block information, based on the change frequency of the block information of each characteristic point of object shapes between multiple preceding video frames, Determine whether the block information of the object shapes in multiple preceding video frames is stablized.Wherein, in the block information of a characteristic point Change frequency be less than predetermined threshold (for example, TH3) in the case of, the block information of this feature point between multiple preceding video frames It will be considered as stable.In addition, in the case that the block information of all characteristic points is stablized between multiple previous videos, it is multiple previous The block information of object shapes in video frame will be considered as stable.

More specifically, for a characteristic point of object shapes between multiple preceding video frames, by calculating each two The change frequency of the editing distance of this feature point between adjacent preceding video frame obtains the variation of the block information of this feature point Frequency.For example, for a characteristic point from (t-n) a video frame to the object shapes of (t-1) a video frame, first will Calculate (n-1) a editing distance of this feature point between each two adjacent preceding video frame.Then, it is a that this (n-1) will be calculated The change frequency of editing distance, and the change frequency of the block information of this feature point will be calculated for example, by following formula：

It is back to Fig. 6, determines the object original shape in t-th of video frame and the initial shape of object in t-th of video frame After the block information of shape, in shape update step S423, shape updating unit 330 is based on the object in t-th of video frame The block information of original shape updates the object original shape in t-th of video frame.Based at the beginning of the object in t-th of video frame The block information of beginning shape, it may be determined which characteristic point of the object original shape in t-th of video frame be block characteristic point with And which characteristic point of the object original shape in t-th of video frame is unshielding characteristic point.

Therefore, for the unshielding characteristic point of the object original shape in t-th of video frame, shape updating unit 330 is logical It crosses and updates unshielding characteristic point using shape detecting method (such as, CSR methods and the shape detecting method based on deep learning) Position.That is, will shape detecting method be used to determine the final position of unshielding characteristic point.

In view of unshielding characteristic point and block the stability of the geometrical relationship relative to subject area between characteristic point, needle To the characteristic point of blocking of the object original shape in t-th of video frame, shape updating unit 330 based on unshielding characteristic point most Final position is set and unshielding characteristic point and blocks the geometrical relationship relative to subject area between characteristic point, and characteristic point is blocked in update Position.That is, by characteristic point is blocked based on the final position of unshielding characteristic point and particular geometric relationship to determine Final position.

For example, in the case where the object to be tracked is face, corresponding geometrical relationship may include following relationship：

Relationship 1：The distance between two mid-eyes are about the one third of face area width；And/or

Relationship 2：The distance between the distance between face center and left eye center, face center and right eye center and left Eye center is roughly the same with the distance between right eye center；And/or

Relationship 3：The distance between nose center and face center are about a quarter of face area height.

In addition, by taking the face blocked by mask as an example, that is to say, that the feature at least around left eye region and right eye region Point be unshielding characteristic point and characteristic point at least around face region be block characteristic point, accordingly, it is determined that left eye region with After the final position of characteristic point around right eye region, shape updating unit 330 can be based on above-mentioned relation 2 and left eye region and The final position of characteristic point around right eye region determines the final position of the characteristic point around face region.

It is back to Fig. 6, after determining the final position of unshielding characteristic point and blocking the final position of characteristic point, that is, It says, after determining the object final shape in t-th of video frame, after being transmitted to the more acurrate block information of object shapes Continuous video frame, in information update step S424, information updating unit 340 based on the object final shape in t-th of video frame, Update the block information of the object original shape in t-th of video frame.Then, updated block information will be considered as t-th The correspondence block information of object shapes in video frame.

In one implementation, information updating unit 340 by based on pre-generatmg block grader or other are blocked Judgment method judges the block information of each characteristic point of the object final shape in t-th of video frame, is regarded to update t-th The block information of object original shape in frequency frame.In one implementation, the grader that blocks of pre-generatmg is using such as The learning methods such as support vector machines (Support Vector Machine, SVM) algorithm, Adaboost algorithm, according to positive sample The binary classifier generated with negative sample.Wherein, positive sample is by being adopted to the correspondence image blocked around characteristic point What sample generated, and negative sample is by carrying out sampling generation to the correspondence image around unshielding characteristic point.

Fig. 7 A to Fig. 7 D schematically show the object determined by step S420 shown in Fig. 6 in t-th of video frame The example of the block information of shape.Fig. 7 A show that the corresponding of the object shapes from the 0th video frame to (t-1) a video frame hides Keep off information.Wherein, horizontal direction indicates that the quantity of video frame, vertical direction indicate the quantity of the characteristic point of object shapes, symbol " O " indicates that corresponding characteristic point is unshielding characteristic point, and symbol " X " indicates that corresponding characteristic point is to block characteristic point.Execute Fig. 6 Shown in after step S422, the correspondence block information of the object original shape in t-th of video frame is as shown in fig.7b.Figure 7C shows to execute the correspondence block information of the object shapes in t-th of video frame after step S424.Fig. 7 D show to regard from the 0th Frequency frame to the object shapes of t-th of video frame correspondence block information.

As described above, in the present invention, on the one hand, the block information determined from preceding video frame will be used to assist determining Object final shape in current video frame.Therefore, during the object final shape in determining current video frame, can make The final position for differently determining the final position and unshielding characteristic point of blocking characteristic point, thus prevents input video Present in block generated influence.On the other hand, after determining the object final shape in current video frame, current video Object final shape in frame will be used to that the correspondence for updating the object original shape in current video frame be assisted to block letter in turn Breath, so that the block information that will be transmitted to subsequent video frame is more acurrate.Therefore, according to the present invention, in tracking video Object shapes during, in the case that object in video is blocked by other objects, a video frame can be improved even The accuracy of the object shapes of entire video and the accuracy of object tracking result.

(image processing system)

In crowded admission scene (for example, on street, at the mall, in supermarket etc.), it will usually occur one The case where people is blocked (as shown in fig. 1b) by another person, that is to say, that, it will usually occur having corresponding screening between character shape The case where gear.Therefore, people will usually be influenced by being blocked caused by other people during particular person in tracking video, in video The accuracy of object tracking.For example, being blocked caused by other people in video usually will cause to lose the people tracked, or cause Switch the tracking ID of the people tracked.Wherein, for example, the tracking ID for the people that switching is tracking includes the people to track It assigns new tracking ID or exchanges the tracking ID of two people tracked.

Inventor find, during the shape of particular person in tracking video, do not consider non-moving object (for example, Mask, sunglasses, scarf) to being blocked caused by the particular person in the case of and do not hidden by any other people in the particular person In the case of gear, the block information of the particular person will remain unchanged between all video frame of video.In the particular person at certain One period intersected with another person pass by the case of, the block information for the people that is blocked between corresponding video frame will change, And the block information for the people that is not blocked between corresponding to video frame will remain unchanged.Therefore, it has been recognised by the inventors that people between video frame The block information of object can be used for assisting the shape of tracking person, thus can prevent in video the influence blocked caused by other people And the accuracy of personage's tracking can be improved.

As the exemplary application of the above process with reference to Fig. 3 to Fig. 7 D, next, exemplary diagram will be described with reference to Fig. 8 As processing system.As shown in Figure 8, image processing system 800 includes device 300 (that is, first image processing apparatus), the second figure As processing unit 810 and third image processing apparatus 820.In one implementation, device 300, the second image processing apparatus 810 and third image processing apparatus 820 be connected to each other via system bus.In another implementation, device 300, second Image processing apparatus 810 and third image processing apparatus 820 are connected to each other via network.In addition, at device 300, the second image It manages device 810 and third image processing apparatus 820 can be via identical electronic equipment (for example, computer, PDA, mobile phone, phase Machine) it realizes.Optionally, device 300, the second image processing apparatus 810 and third image processing apparatus 820 can also be via different Electronic equipment is realized.

As shown in Figure 8, first, device 300 and the second image processing apparatus 810 receive from specialized electronic equipment (for example, Camera) output or video input by user.

Then, for any two people in input video, device 300 determines the every of input video with reference to Fig. 3 to Fig. 7 D In one video frame in everyone shape and each video frame of input video everyone shape block information.

Further, for any two people in input video, the second image processing apparatus 810 determines input video The tracking information of everyone shape in each video frame.In one implementation, the second image processing apparatus 810 General tracking such as is executed to each video frame of input video, with the corresponding tracking information of determination.Each video frame In a people shape tracking information for example including the people tracking ID, the people shape each characteristic point track Deng.

Then, for any two people in any one video frame of input video, 820 base of third image processing apparatus The block information of everyone shape at least one preceding video frame of the video frame and at least the one of the video frame The tracking information of everyone shape in a preceding video frame determines the two person-to-person hiding relation.Wherein, two The position relationship blocked occurred between hiding relation especially the two people between people.For example, between personage A and personage B Hiding relation indicate that personage A is blocked that either personage B is blocked by personage A or personage A and personage B does not hide each other by personage B Gear.

In order to reduce calculation amount, in one implementation, third image processing apparatus 820 is based on any of input video Between the preceding video frame of one video frame two in the variable quantity of the unshielding characteristic point of everyone shape and the video frame Person-to-person relative position determines the hiding relation between the two people in the video frame.

More specifically, on the one hand, for two people in the particular video frequency frame of input video, based in the particular video frequency The tracking information of the two people determined by the second image processing apparatus 810 at least one video frame before frame, especially Correspondence rail based on each characteristic point of the shape of the two people at least one video frame before the particular video frequency frame Mark, third image processing apparatus 820 determine the correspondence relative position between the two people in the particular video frequency frame.In an example In, the relative position between the two people is calculated as the Euclidean distance (Euclidean distance) between the two people.

On the other hand, for two people in the particular video frequency frame of input video, based on before the particular video frequency frame The block information of the two people determined by device 300 at least one video frame, third image processing apparatus 820 determine first The quantity of the unshielding characteristic point of everyone shape in each preceding video frame.Then, based on identified unshielding The quantity of characteristic point, third image processing apparatus 820 determine the unshielding feature of everyone shape between preceding video frame The variable quantity of point.

Then, third image processing apparatus 820 is based on relative position between the two identified people and identified The variable quantity of the unshielding characteristic point of everyone shape blocks pass to determine in the particular video frequency frame between the two people System.

Fig. 9 A and Fig. 9 B are shown schematically in two people (for example, personage A and personage B) in crowded admission scene. Wherein, Fig. 9 A show the opposite position between personage A and personage B in a preceding video frame (such as, (t-m) a video frame) It sets.Fig. 9 B show the relative position between personage A and personage B in current video frame (such as, t-th of video frame).For personage A, it can be seen that from (t-m) a video frame to t-th of video frame, the quantity of the unshielding characteristic point of the shape of personage A is kept It is constant.For personage B, it can be seen that between the video frame on t-th of video frame periphery, the unshielding feature of the shape of personage B The quantity of point is gradually decreasing.Therefore, it is possible to judge that going out, at the period near t-th of video frame, personage B is hidden by personage A Gear.In other words, at the period near t-th of video frame, the hiding relation between personage A and personage B is personage B by personage A It blocks.

In addition, normally resulting in the people that switching is tracking as described above, being blocked caused by another person in video Tracking ID, so as to cause the tracking result of the output error during personage tracks.Especially, it is counting in a particular space Or in application that statistics is by the demographics of the quantity of the people of a particular space, during the people in tracking video In the case of the tracking ID for having switched a personage, it will the demographics result of output error.Therefore, in tracking video During people, in the case of there is corresponding block between personage, cut before and after there is the specific position blocked to reduce Tracking ID is changed, the accuracy to which personage's tracking can be improved is determining the two for any two people in input video After hiding relation between people, third image processing apparatus 820 is by the two in each video frame based on input video Hiding relation between people further updates the tracking information of the two people determined by the second image processing apparatus 810.Example Such as, find that two tracking ID before and after there is the specific position blocked are actually belonged to together in third image processing apparatus 820 In the case of one people, 820 high-ranking officers' lookup error of third image processing apparatus tracks ID.

In one implementation, is there is specific block by following operation judges in third image processing apparatus 820 Whether two tracking ID before and after position belong to same person.By taking personage D shown in Figure 10 B as an example, wherein Figure 10 B are shown Occur t-th of the video frame blocked on wherein personage D, Figure 10 A show (t-1) a video frame before t-th of video frame, figure 10C shows (t+1) a video frame after t-th of video frame.In addition, from (t-1) a video frame to (t+1) a video Frame, the hiding relation between personage C and personage D are that personage D is blocked by personage C.

For the personage D in (t-1) a video frame, the people from (t-1) a video frame of third image processing apparatus 820 The corresponding external appearance characteristic vector of unshielding feature point extraction of the shape of object D, wherein based on personage C in (t-1) a video frame Hiding relation between personage D, determines the unshielding characteristic point of the shape of personage D in (t-1) a video frame.For (t + 1) the personage D in a video frame, the non-screening of the shape of personage D from (t+1) a video frame of third image processing apparatus 820 Keep off the corresponding external appearance characteristic vector of feature point extraction, wherein based on the screening in (t+1) a video frame between personage C and personage D Gear relationship, determines the unshielding characteristic point of the shape of personage D in (t+1) a video frame.

Then, the similarity measurement between the two external appearance characteristic vectors is less than or equal to predetermined threshold (for example, TH4) In the case of, third image processing apparatus 820 judges in personage D and (t+1) a video frame in (t-1) a video frame Personage D is actually same person.That is, in (t-1) a video frame personage D tracking ID and (t+1) a video The tracking ID of personage D is answered identical in frame.Different, 820 high-ranking officers of third image processing apparatus in the two tracking ID Lookup error tracks ID, it can thus be ensured that no matter before t-th of video frame or after t-th of video frame, is blocked by personage C Personage D still can tracking ID having the same.Wherein, the similarity measurement between two external appearance characteristic vectors is for example, by calculating The distance between the two external appearance characteristic vectors obtain.

As described above, in the present invention, image processing system 800 shown in fig. 8 can determine any the one of input video The hiding relation between any two people in a video frame.Therefore, it is blocked there are specific between the personage in input video In the case of, that is to say, that there are in the case of specific block between the character shape in input video, image processing system 800 can be based on hiding relation correction error tracking ID.Therefore, the present invention can reduce before and after there is specific blocking position switching with Track ID, thus the accuracy of personage's tracking can be improved.

Above-mentioned all units contribute to realize the exemplary and/or preferred module handled described in the disclosure.This A little units can be hardware cell (such as, field programmable gate array (FPGA), digital signal processor, application-specific integrated circuit Deng) and/or software module (such as, computer-readable program).It does not describe at large for realizing the unit of each step above. However, in the case of there is the step of executing particular procedure, there may be the corresponding function moulds for realizing the same process Block or unit (passing through hardware and/or software realization).All groups of the step of passing through description and unit corresponding to these steps The technical solution of conjunction is included in disclosure herein, as long as the technical solution that they are constituted is complete, is applicable in.

Can methods and apparatus of the present invention be implemented in various ways.For example, can by software, hardware, firmware or Any combination thereof implements methods and apparatus of the present invention.Unless otherwise expressly specified, otherwise this method the step of it is above-mentioned suitable Sequence is only intended to be illustrative, and the step of method of the present invention is not limited to the sequence of above-mentioned specific descriptions.In addition, one In a little embodiments, the present invention can also be implemented as the program recorded in the recording medium comprising for realizing according to this hair The machine readable instructions of bright method.Therefore, the present invention also covers storage for realizing program according to the method for the present invention Recording medium.

Although some specific embodiments of the present invention, those skilled in the art have been shown in detail by example Member is it should be understood that above-mentioned example is only intended to be illustrative, and does not limit the scope of the invention.Those skilled in the art should Understand, above-described embodiment can be changed without departing from the scope and spirit of the present invention.The scope of the present invention is by institute Attached claim limits.

Claims

1. a kind of device for tracking the object shapes in video, described device include：

Shape determining unit, the object being configured as at least one preceding video frame of the current video frame based on input video Shape determines the object shapes in the current video frame；

Information determination unit is configured as blocking letter based on the object shapes at least one preceding video frame Breath determines the block information of the object shapes determined by the shape determining unit；

Shape updating unit is configured as, based on the block information determined by described information determination unit, updating by described The object shapes that shape determining unit determines；And

Information updating unit is configured as being based on the newer object shapes of the shape updating unit, update by the letter Cease the block information that determination unit determines.

2. the apparatus according to claim 1, wherein described information determination unit is based on described in the preceding video frame The block information of object shapes determines the object determined by the shape determining unit using Statistics-Based Method The block information of shape.

3. the apparatus according to claim 1, wherein blocked described in the object shapes in the preceding video frame In the case of information stability, described information determination unit is by the institute of the object shapes in any one described preceding video frame It states block information to be considered as, by the block information for the object shapes that the shape determining unit determines.

4. device according to claim 3, wherein based on each of the object shapes between the preceding video frame The change frequency of the block information of characteristic point, to determine that the block information of object shapes described in the preceding video frame is No stabilization.

5. the apparatus according to claim 1, wherein be directed to any one video frame, object shapes described in the video frame The block information by the Based on Feature Points of the object shapes be block characteristic point and unshielding characteristic point.

6. device according to claim 5, wherein the shape updating unit uses shape detecting method, updates by institute State the position of the unshielding characteristic point of the object shapes of shape determining unit determination.

7. device according to claim 6, wherein it is newer that the shape updating unit is based on the shape updating unit The position of the unshielding characteristic point and unshielding characteristic point and the geometry relative to subject area blocked between characteristic point close System updates the position that characteristic point is blocked described in the object shapes determined by the shape determining unit.

8. the apparatus according to claim 1, wherein described information updating unit based on pre-generatmg by blocking grader The block information for judging each characteristic point by the newer object shapes of the shape updating unit, to update by described The block information that information determination unit determines.

9. the apparatus according to claim 1, described device further includes：

Detection unit is configured as, and first video frame for the input video or is obtained for from the input video A sequence of frames of video in first video frame, detect the object shapes in first video frame and detect described the The block information of the object shapes detected described in one video frame.

10. device according to claim 9, wherein the detection unit by detect subject area in occlusion area, Detect the block information of the object shapes detected described in first video frame.

11. device according to claim 10, wherein the detection unit is by making the occlusion area detected The occlusion area detected is updated with image partition method.

12. a kind of method for tracking the object shapes in video, the method includes：

Shape determines step, the object shapes at least one preceding video frame of the current video frame based on input video, really Object shapes in the fixed current video frame；

Information determining step, based on the block information of the object shapes at least one preceding video frame, determine by The shape determines the block information for the object shapes that step determines；

Shape updates step, determines that the block information that step determines, update determine step by the shape based on described information Suddenly the object shapes determined；And

Information update step updates the newer object shapes of step based on the shape, and update determines step by described information Suddenly the block information determined.

13. according to the method for claim 12, the method further includes：

Detecting step, first video frame for the input video or for the video obtained from the input video First video frame in frame sequence detects the object shapes in first video frame and detects first video frame Described in the block information of object shapes that detects.

14. a kind of image processing system, the system comprises：

First image processing apparatus, is configured as, for any two people in input video, according to claim 1 to right It is required that any one claim in 11, determine everyone shape in each video frame of the input video and The block information of everyone shape in each video frame of the input video；

Second image processing apparatus, is configured as, and for any two people in the input video, determines the input The tracking information of everyone shape in each video frame of video；And

Third image processing apparatus, is configured as, for described any two in any one video frame of the input video Individual, the block information of everyone shape at least one preceding video frame based on the video frame and should The tracking information of everyone shape at least one preceding video frame of video frame, determines the two Hiding relation between people.

15. system according to claim 14, wherein the third image processing apparatus is based on the every of the input video The hiding relation between the two people determined by the third image processing apparatus in one video frame is updated by described The tracking information that second image processing apparatus determines.

16. system according to claim 14, wherein for described in any one video frame of the input video Any two people, between the preceding video frame of the third image processing apparatus based on the video frame described in everyone Relative position in the variable quantity of the unshielding characteristic point of shape and the video frame between the two people determines between the two people The hiding relation.