CN106920250B

CN106920250B - Robot target identification and localization method and system based on RGB-D video

Info

Publication number: CN106920250B
Application number: CN201710078328.9A
Authority: CN
Inventors: 陶文兵; 李坤乾
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-02-14
Filing date: 2017-02-14
Publication date: 2019-08-13
Anticipated expiration: 2037-02-14
Also published as: CN106920250A

Abstract

The invention discloses a kind of robot target identification and localization method and system based on RGB-D video, by target candidate extraction, identification, the optimization of the reliability estimating based on temporal consistency, Target Segmentation, location estimation, target category is determined in the scene and obtains accurate spatial position positioning.Depth information of scene is utilized in the present invention, enhance the spatial level perception ability of identification and location algorithm, by using space-time consistency constraint when length based on key frame, while improving video treatment effeciency, the identity and relevance of target in long timing target identification and location tasks ensure that.In position fixing process, by the Accurate Segmentation target in plane space and the location consistency in the same target of depth information evaluation space, the collaboration target positioning in multi information mode is realized.Calculation amount is small, and real-time is good, and identification is high with positioning accuracy, can be applied to parse the robot task of understanding technology based on online visual information.

Description

Robot target identification and localization method and system based on RGB-D video

Technical field

The invention belongs to technical field of computer vision, more particularly, to a kind of robot mesh based on RGB-D video Mark not with localization method and system.

Background technique

In recent years, with the fast development of robot technology, the machine vision technique of object manipulator task is also obtained The extensive concern of researcher.Wherein, the identification Yu accurate positioning of target are an important rings for robot vision problem, after being execution The precondition of continuous task.

Existing target identification method generally comprise extract target information to be identified as basis of characterization and with field to be identified Two steps of matching of scape.The expression of traditional target to be identified generally comprises geometry, target appearance, extracts local feature The methods of, often there is the deficiencies of poor universality, stability are insufficient, target abstracting capabilities are poor in such methods.The above object table The defect reached also brings the difficulty for being difficult to overcome to subsequent matching process.

After the expression for obtaining target to be identified, object matching refer to will obtain the objective expression and scene characteristic to be identified into Row compares, to identify target.Generally speaking, existing method includes the two class methods based on Region Matching and characteristic matching.Base Matching in region refers to that the information for extracting image local subregion is compared, calculation amount and subregion number to be matched It is directly proportional；The characteristic feature in image is matched based on the method for feature, matching accuracy rate and feature representation validity It is closely related.The above two classes method proposes higher requirement to the acquisition of candidate region and feature representation, but due to two dimension The limitation of flat image information and design feature, often effect is poor in the complex environment identification mission of object manipulator.

Target positioning is widely present in industrial production life, such as the GPS in outdoor sports, military radar monitoring, naval vessels Sonar etc., such equipment accurate positioning, operation distance range are very wide but at high price.The positioning system of view-based access control model It is research hotspot new in recent years.According to the difference of visual sensor, be broadly divided into based on monocular vision sensor, binocular and The localization method of depth transducer, panoramic vision sensor.Monocular vision sensor price is low, structure is simple, is easy to demarcate, but Positioning accuracy is often poor；Panoramic vision sensor can get complete scene information, and positioning accuracy is higher, but computationally intensive, It is expensive that real-time is poor, the device is complicated；Estimation of Depth or depth information acquisition equipment based on binocular vision are to scene distance sense Know that ability is stronger, and system is relatively simple, real-time is easily achieved, and the concern being subject in recent years is also more and more.But this neck The research in domain is still at an early stage, still lacks target positioning side that is efficient, can handling RGB-Depth video in real time at present Method.

Due to demand with higher for depth information sensing capability, existing robot system acquires mostly RGB-Depth video as visual information source, depth information be the three-dimensional perception of scene, complex target hierarchy divide, Positioning provides information abundant.However, due to the complexity of robot operative scenario, computation complexity is higher, operand compared with Greatly, at present there has been no system, quickly and easily RGB-Depth video object identification and accurate positioning method.Therefore, research is based on The Indoor Robot target identification and Precision Orientation Algorithm of RGB-Depth video not only have very strong researching value, but also have Boundless application prospect.

Summary of the invention

Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of machines based on RGB-D video People's target identification and localization method and system, the RGB-Depth video obtained by the first visual angle of handling machine people are realized real-time , the precise positioning that accurate target identification and target are in robot working environment, thus the complexity such as auxiliary mark crawl Robot task.Thus the technology for lacking object localization method efficient, that RGB-Depth video can be handled in real time at present is solved Problem.

To achieve the above object, according to one aspect of the present invention, a kind of robot mesh based on RGB-D video is provided Mark is not and localization method, comprising:

(1) the RGB-D sequence of frames of video of scene where obtaining positioning target to be identified；

(2) the key video sequence frame in the RGB-D sequence of frames of video is extracted, and target is extracted to the key video sequence frame and is waited Favored area is filtered screening to the object candidate area according to the corresponding depth information of each key video sequence frame；

(3) object candidate area after filtering screening is identified based on depth network, passes through long timing space time correlation Constraint and multiframe identify Uniform estimates, carry out confidence level sequence to target identification result；

(4) local Fast Segmentation is carried out to the object candidate area after filtering screening, according to the confidence of target identification result The timing intervals relationship of degree and each key video sequence frame chooses chief video frame from the key video sequence frame, and to segmentation Region carries out front and back consecutive frame extension and collaboration optimization；

(5) it determines that key feature points are used as positioning reference point in the scene, and then estimates that camera perspective and camera motion are estimated Evaluation, by identifying that segmentation result carries out target signature consistency constraint and target position consistency about to chief video frame Beam estimates the collaboration confidence level of positioning target to be identified and carries out space accurate positioning.

Preferably, the step (2) specifically includes:

(2.1) with interval sampling or key frame extraction method, the key video sequence of positioning target to be identified for identification is determined Frame；

(2.2) using the target candidate obtained based on the confidence level sort method like physical property priori in the key video sequence frame Region forms object candidate area set and obtains each object candidate area using the corresponding depth information of each key video sequence frame Hierarchy attributes in internal and its neighborhood, optimize screening to the object candidate area set, sort again.

Preferably, the step (3) specifically includes:

(3.1) object candidate area after step (2) are screened is sent into trained target identification depth network, The target identification prediction result of the corresponding key video sequence frame of object candidate area after obtaining each screening and the prediction of each target identification As a result the first confidence level；

(3.2) it is constrained according to the space time correlation of long timing, feature is carried out to the target identification prediction result of key video sequence frame Conformance Assessment evaluates the second confidence level of each target identification prediction result, will be set by first confidence level with described second The accumulation confidence level that reliability obtains is ranked up, and further filters out the target time that accumulation confidence level is lower than default confidence threshold value Favored area.

Preferably, the step (4) specifically includes:

(4.1) object candidate area for step (3.2) acquisition and its extension neighborhood, carry out quick Target Segmentation behaviour Make, obtains the initial segmentation of target, determine object boundary；

It (4.2) is constraint with space-time consistency in short-term, based on the accumulation confidence level ranking results in step (3.2), from institute It states and filters out chief video frame in key video sequence frame；

(4.3) with it is long when space-time consistency be constraint, be based on step (4.1) initial segmentation, to positioning target to be identified Appearance modeling is carried out, 3-D graphic building is carried out to chief video frame and its consecutive frame, and design maximum a posteriori probability-horse Er Kefu random field energy function, cuts algorithm by figure and optimizes to initial segmentation, is closing to the object segmentation result of single frames Extension and optimization are split in consecutive frame before and after key video frame.

Preferably, the step (5) specifically includes:

(5.1) the chief video frame that step (4.2) are obtained, according to adjacent between each chief video frame And visual field coincidence relation, multiple groups same place point is extracted to as positioning reference point；

(5.2) the chief video frame estimation camera perspective variation being overlapped according to the visual field, and then pass through geometrical relationship, benefit With the motion information of the depth information estimation camera of positioning reference point point pair；

(5.3) according to the information that fathoms, camera perspective and the phase of positioning target to be identified in chief video frame The motion information of machine evaluates the spatial position consistency to be identified for positioning target in chief video frame；

(5.4) according to step (4.3) as a result, the feature consistency of evaluation positioning target two dimension cut zone to be identified；

(5.5) feature consistency by overall merit positioning target two dimension cut zone to be identified and spatial position one Cause property determines the spatial position of positioning target to be identified.

It is another aspect of this invention to provide that provide it is a kind of based on RGB-D video robot target identification with positioning be System, comprising:

Module is obtained, the RGB-D sequence of frames of video for scene where obtaining positioning target to be identified；

Filtering screening module, for extracting the key video sequence frame in the RGB-D sequence of frames of video, and to the crucial view Frequency frame extracts object candidate area, is filtered according to the corresponding depth information of each key video sequence frame to the object candidate area Screening；

Confidence level sorting module is led to for being identified based on depth network to the object candidate area after filtering screening Too long timing space time correlation constraint and multiframe identify Uniform estimates, carry out confidence level sequence to target identification result；

Optimization module, for carrying out local Fast Segmentation to the object candidate area after filtering screening, according to target identification As a result the timing intervals relationship of confidence level and each key video sequence frame chooses chief video from the key video sequence frame Frame, and front and back consecutive frame is carried out to cut zone and extends and cooperate with optimization；Wherein, optimization module is especially by following steps reality It is existing:

Quick Target Segmentation operation is carried out to object candidate area and its extension neighborhood, obtains the initial segmentation of target, Determine object boundary；

It is constraint with space-time consistency in short-term, based on accumulation confidence level ranking results, filters out master from key video sequence frame Want key video sequence frame；

With it is long when space-time consistency be constraint, be based on initial segmentation, to positioning target to be identified progress appearance modeling, to master It wants key video sequence frame and its consecutive frame to carry out 3-D graphic building, and designs maximum a posteriori probability-markov random file energy Function cuts algorithm by figure and optimizes to initial segmentation, adjacent before and after key video sequence frame to the object segmentation result of single frames Extension and optimization are split in frame；

Locating module, for determine in the scene key feature points as positioning reference point, and then estimate camera perspective and Camera motion estimated value, by identifying that segmentation result carries out target signature consistency constraint and target position to chief video frame Consistency constraint is set, estimate the collaboration confidence level of positioning target to be identified and carries out space accurate positioning.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, mainly have skill below Art advantage: utilizing depth information of scene in the present invention, the spatial level perception ability of identification and location algorithm is enhanced, by adopting It ensure that long timing target while improving video treatment effeciency with space-time consistency constraint when length based on key frame The identity and relevance of identification and target in location tasks.In position fixing process, pass through the Accurate Segmentation mesh in plane space Be marked with and the same target of depth information evaluation space location consistency, it is fixed to realize collaboration target in multi information mode Position.Calculation amount is small, and real-time is good, and identification is high with positioning accuracy, can be applied to parse understanding technology based on online visual information Robot task.

Detailed description of the invention

Fig. 1 is the overall procedure schematic diagram of present invention method；

Fig. 2 is the flow diagram of target identification in the embodiment of the present invention；

Fig. 3 is the flow diagram of targeting accuracy positioning in the embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.

Method disclosed by the invention is related to key frame screening, the target identification based on depth network, segmentation, label interframe biography Pass, the location estimation based on consistency constraint and collaboration optimization etc. technologies, can be directly used for RGB-D video being that visual information is defeated In the robot system entered, auxiliary robot completes target identification and targeting accuracy location tasks.

It is as shown in Figure 1 the overall procedure schematic diagram of present invention method.It will be seen from figure 1 that this method includes Target identification and target are accurately positioned two big steps, and target identification is the precondition of targeting accuracy positioning.Its specific embodiment party Formula is as follows:

Preferably, in an embodiment of the invention, can by Kinect even depth visual sensor acquire to The RGB-D video sequence of scene where identification positioning target；RGB picture pair can also be acquired by binocular imaging apparatus, and passed through Disparity estimation depth information of scene is calculated as depth channel information, to synthesize RGB-D video as input.

(2) the key video sequence frame in RGB-D sequence of frames of video is extracted, and object candidate area is extracted to key video sequence frame, Screening is filtered to object candidate area according to each key video sequence frame corresponding depth information；

(4) local Fast Segmentation is carried out to the object candidate area after filtering screening, according to the confidence of target identification result The timing intervals relationship of degree and each key video sequence frame chooses chief video frame from key video sequence frame, and to cut zone Carry out front and back consecutive frame extension and collaboration optimization；

Preferably, in one embodiment of the invention, above-mentioned steps (1) specifically include:

(1.1) the RGB-D video sequence of scene where acquiring positioning target to be identified with Kinect, and sampled and put down with neighborhood Sliding mode depth of cracking closure image cavity, is modified to it according to Kinect parameter and is converted to real depth information, with RGB number According to as input；

(1.2) when as clock synchronization, passed sequentially through using the acquisition of binocular equipment camera calibration, Stereo matching (as to feature extraction, Same physical structure corresponding points are extracted, calculate parallax) step, finally by projection model estimating depth as depth in video The input in channel.

Preferably, in one embodiment of the invention, above-mentioned steps (2) specifically include:

Wherein, step (2.1) specifically includes: utilizing quick Scale invariant features transform (Scale-invariant Feature transform, SIFT) the scene Duplication that matching process obtains consecutive frame is put, to estimate the field of current shooting Scape change rate switches faster video frame for photographed scene, improves sample frequency, switches slower video for photographed scene Frame reduces sample frequency.In addition, interval sampling side can be directlyed adopt when practical application request is more demanding to efficiency of algorithm Method substitutes this step.

Wherein, BING algorithm or Edge box algorithm can be based on the confidence level sort method like physical property priori.Such as Fig. 2 It is shown, the depth information of corresponding frame is recycled, the hierarchy attributes inside object candidate area and its in neighborhood is obtained, is set according to height Answer that depth information is smooth, the biggish principle of in-out-snap boundary depth information gradient inside the candidate frame of reliability, to target candidate Regional ensemble optimizes screening, sorts again.

Preferably, in one embodiment of the invention, above-mentioned steps (3) specifically include:

(3.1) know as shown in Fig. 2, the object candidate area after step (2) are screened is sent into trained target Other depth network, the target identification prediction result of the corresponding key video sequence frame of object candidate area after obtaining each screening and each mesh Identify the first confidence level of other prediction result；

Wherein, trained target identification depth network can be such as SPP-Net, R-CNN, Fast-R-CNN deeply Degree identification network, can also be substituted by other depth recognition networks.

(3.2) it is constrained according to the space time correlation of long timing, feature is carried out to the target identification prediction result of key video sequence frame Conformance Assessment evaluates the second confidence level of each target identification prediction result, will be obtained by the first confidence level and the second confidence level Accumulation confidence level be ranked up, further filter out the object candidate area that accumulation confidence level is lower than default confidence threshold value.

Optionally, in one embodiment of the invention, it can be obtained by applying identification instruction to algorithm to be identified The detection recognition result of target is positioned, and passes through filtering low confidence recognition result boosting algorithm efficiency.

Optionally, in one embodiment of the invention, above-mentioned steps (4) specifically include:

(4.1) as shown in figure 3, object candidate area and its extension neighborhood, progress for step (3.2) acquisition are quick Target Segmentation operation, obtains the initial segmentation of target, determines object boundary；

Wherein, as an alternative embodiment, can be used the GrabCut partitioning algorithm based on RGB-D information into The quick Target Segmentation operation of row, obtains the initial segmentation of target, to obtain the two-dimensional localization of target in current video frame As a result.

(4.2) in order to further increase the efficiency that video object positions, as shown in figure 3, being about with space-time consistency in short-term Beam is strong with single frames recognition confidence height, consecutive frame space-time consistency based on the accumulation confidence level ranking results in step (3.2) For criterion, chief video frame is filtered out from key video sequence frame；

(4.3) with it is long when space-time consistency be constraint, be based on step (4.1) initial segmentation, to positioning target to be identified Appearance modeling is carried out, 3-D graphic building is carried out to chief video frame and its consecutive frame, and design maximum a posteriori probability-horse Er Kefu random field energy function, cuts algorithm by figure and optimizes to initial segmentation, is closing to the object segmentation result of single frames It is split extension in consecutive frame before and after key video frame, to realize that the two dimension target based on length-space-time consistency in short-term is divided Positioning and optimizing.

Optionally, in one embodiment of the invention, above-mentioned steps (5) specifically include:

(5.1) as shown in figure 3, for the chief video frame that step (4.2) obtain, according to each chief video frame Between adjacent and visual field coincidence relation, extract multiple groups same place point to as positioning reference point；

Wherein, the motion information of camera includes camera moving distance and motion track.

(5.3) as shown in figure 3, according to the information that fathoms of positioning target to be identified in chief video frame, camera The motion information of visual angle and camera evaluates the spatial position consistency to be identified for positioning target in chief video frame；

(5.4) according to step (4.3) as a result, evaluation it is to be identified positioning target two dimension cut zone feature consistency, It is general that regional depth feature is extracted for characteristic distance measurement and feature consistency evaluation using the depth network based on region；

In one embodiment of the invention, a kind of robot target identification and positioning based on RGB-D video is disclosed System, the system include:

Optimization module, for carrying out local Fast Segmentation to the object candidate area after filtering screening, according to target identification As a result the timing intervals relationship of confidence level and each key video sequence frame chooses chief video from the key video sequence frame Frame, and front and back consecutive frame is carried out to cut zone and extends and cooperate with optimization；

Wherein, the specific embodiment of each module is referred to the description of embodiment of the method, and the embodiment of the present invention will not be done It repeats.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims

1. a kind of robot target identification and localization method based on RGB-D video characterized by comprising

(2) the key video sequence frame in the RGB-D sequence of frames of video is extracted, and target candidate area is extracted to the key video sequence frame Domain is filtered screening to the object candidate area according to the corresponding depth information of each key video sequence frame；

(3.1) object candidate area after step (2) are screened is sent into trained target identification depth network, obtained The target identification prediction result and each target identification prediction result of the corresponding key video sequence frame of object candidate area after each screening The first confidence level；

(3.2) it is constrained according to the space time correlation of long timing, it is consistent to carry out feature to the target identification prediction result of key video sequence frame Property evaluation, evaluate the second confidence level of each target identification prediction result, will be by first confidence level and second confidence level Obtained accumulation confidence level is ranked up, and further filters out the target candidate area that accumulation confidence level is lower than default confidence threshold value Domain；

(4.1) object candidate area for step (3.2) acquisition and its extension neighborhood, carry out quick Target Segmentation operation, The initial segmentation for obtaining target, determines object boundary；

It (4.2) is constraint with space-time consistency in short-term, based on the accumulation confidence level ranking results in step (3.2), from the pass Chief video frame is filtered out in key video frame；

(4.3) with it is long when space-time consistency be constraint, be based on step (4.1) initial segmentation, to positioning target to be identified progress Appearance modeling carries out 3-D graphic building to chief video frame and its consecutive frame, and designs maximum a posteriori probability-Ma Erke Husband's random field energy function, cuts algorithm by figure and optimizes to initial segmentation, regards to the object segmentation result of single frames in key Extension and optimization are split in consecutive frame before and after frequency frame；

(5) it determines that key feature points are used as positioning reference point in the scene, and then estimates camera perspective and camera motion estimated value, By identifying that segmentation result carries out target signature consistency constraint and target position consistency constraint to chief video frame, estimate It counts the collaboration confidence level of positioning target to be identified and carries out space accurate positioning.

2. the method according to claim 1, wherein the step (2) specifically includes:

(2.1) with interval sampling or key frame extraction method, the key video sequence frame of positioning target to be identified for identification is determined；

(2.2) using the object candidate area obtained based on the confidence level sort method like physical property priori in the key video sequence frame It forms object candidate area set and obtains the inside of each object candidate area using the corresponding depth information of each key video sequence frame And its hierarchy attributes in neighborhood, screening is optimized to the object candidate area set, is sorted again.

3. the method according to claim 1, wherein the step (5) specifically includes:

(5.1) the chief video frame that step (4.2) are obtained, according to the adjacent and view between each chief video frame Wild coincidence relation extracts multiple groups same place point to as positioning reference point；

(5.2) the chief video frame estimation camera perspective variation being overlapped according to the visual field, and then by geometrical relationship, using fixed The motion information of the depth information estimation camera of position reference point point pair；

(5.3) according to the information that fathoms of positioning target to be identified in chief video frame, camera perspective and camera Motion information evaluates the spatial position consistency to be identified for positioning target in chief video frame；

(5.5) consistent by the feature consistency and spatial position of overall merit positioning target two dimension cut zone to be identified Property, determine the spatial position of positioning target to be identified.

4. a kind of robot target identification and positioning system based on RGB-D video characterized by comprising

Filtering screening module, for extracting the key video sequence frame in the RGB-D sequence of frames of video, and to the key video sequence frame Object candidate area is extracted, sieve is filtered to the object candidate area according to the corresponding depth information of each key video sequence frame Choosing；

Confidence level sorting module passes through length for identifying based on depth network to the object candidate area after filtering screening The constraint of timing space time correlation and multiframe identify Uniform estimates, carry out confidence level sequence to target identification result；

Optimization module, for carrying out local Fast Segmentation to the object candidate area after filtering screening, according to target identification result Confidence level and each key video sequence frame timing intervals relationship, from the key video sequence frame choose chief video frame, and Front and back consecutive frame extension and collaboration optimization are carried out to cut zone；Wherein, the optimization module by object candidate area and It extends neighborhood and carries out quick Target Segmentation operation, obtains the initial segmentation of target, determines object boundary；With space-time one in short-term Cause property is constraint, based on accumulation confidence level ranking results, filters out chief video frame from the key video sequence frame；With length When space-time consistency be constraint, be based on initial segmentation, to positioning target to be identified progress appearance modeling, to chief video frame And its consecutive frame carries out 3-D graphic building, and designs maximum a posteriori probability-markov random file energy function, is cut by figure Algorithm optimizes initial segmentation, is split expansion in consecutive frame before and after the key video sequence frame to the object segmentation result of single frames Exhibition and optimization；

Locating module for determining that key feature points are used as positioning reference point in the scene, and then estimates camera perspective and camera Motion estimated values, by identifying that segmentation result carries out target signature consistency constraint and target position one to chief video frame The constraint of cause property estimates the collaboration confidence level of positioning target to be identified and carries out space accurate positioning.