CN107833236A

CN107833236A - Semantic vision positioning system and method are combined under a kind of dynamic environment

Info

Publication number: CN107833236A
Application number: CN201711040037.7A
Authority: CN
Inventors: 王金戈; 邹旭东; 仇晓松; 曹天扬; 蔡浩原; 李彤
Original assignee: Institute of Electronics of CAS
Current assignee: Institute of Electronics of CAS
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2018-03-23
Anticipated expiration: 2037-10-31
Also published as: CN107833236B

Abstract

The invention discloses the monocular vision alignment system in a kind of dynamic environment and method, with reference to semantic information, realizes the rejecting to dynamic object feature.Ambient image is gathered by monocular-camera in real time, image realizes object detection by convolutional neural networks, obtains the semantic information of object, further realizes that dynamic object judges with reference to priori.The feature in image is extracted using ORB algorithms, and dynamic object characteristic point is rejected according to the position of dynamic object.Local boundling adjustment is carried out to camera pose and 3D point coordinates using the method for nonlinear optimization, so as to eliminate the influence of dynamic object characteristic point and improve positioning precision.

Description

Semantic vision positioning system and method are combined under a kind of dynamic environment

Technical field

The present invention relates to combine language under computer vision and localization for Mobile Robot field, more particularly to a kind of dynamic environment The vision positioning system and method for justice.

Background technology

Simultaneous localization and mapping (Simultaneous Localization And Mapping, SLAM) is a kind of The method established environmental map using sensor information and determine itself pose.It is referred to as vision during using camera as sensor SLAM.Establish positioning of real-time, accurate and robust the SLAM systems to equipment such as robot, unmanned vehicles to be significant, be Realize the basis of navigation and autonomous.

Traditional SLAM technologies are established under static environment, do not consider the motion of environmental objects.And in actual environment, people's Walk about, the dealing of vehicle can all cause environment dynamic change so that SLAM systems establish map can not keep prolonged Uniformity, the feature of view-based access control model as the motion of object and become unstable, Shandong of the SLAM systems under dynamic environment Rod is urgently lifted.

In order that SLAM under dynamic environment normal work, it is necessary to avoid being used in the characteristic point on dynamic object, because This needs calculated in advance to go out the position of dynamic object.Currently used dynamic object extracting method is all based on geometric properties, face to face During to more extreme dynamic environment, such as people's walking about close to camera lens, it can still fail.

At present, the method for the vision positioning under dynamic environment is described below, and by taking field flow method as an example, flow chart is as shown in Figure 1.

This method gathers ambient image in real time by binocular camera, and the feature in image is extracted by feature extraction algorithm Point, Stereo matching is carried out to four width images of former and later two moment of binocular camera collection.Recover special using double-vision geometry principle Sign point three-dimensional information.Matching accuracy is improved in a manner of winding matches.The spy on dynamic object is rejected using field flow method Sign point.Consideration may cause the factor of error and improve field flow method effect by calculating covariance matrix.Changed using Gauss-Newton The method in generation tries to achieve robot motion's parameter by characteristic point position information.Vision positioning standard is further improved using RANSAC algorithms Exactness.The continuous iteration of whole process, realize the real-time calculating to robot pose and position.

During the present invention is realized, it is found by the applicant that following technological deficiency be present in above-mentioned prior art：

(1) mahalanobis distance that dynamic object characteristic point to be deleted is calculated according to field flow method error model determines, Error can be caused to increase using fixed threshold value on the object of different motion form and different motion speed, thing can not be appropriately determined The dynamic characteristic of body.

(2) field flow method error model has done static background it is assumed that only the moving object to zonule has detectability, nothing The large-scale moving object of method processing appears in the situation in the visual field.

(3) only the object moved between consecutive frame is recognized and does dynamic object, without considering the dynamic of object in itself Characteristic.For example people even if sometime remaining static, but still should regard dynamic object as and be picked when being moved before camera Remove.

The content of the invention

(1) technical problems to be solved

In view of this, the present invention proposes to combine semantic vision positioning system and method under a kind of dynamic environment, to solve Certainly above mentioned problem.

(2) technical scheme

A kind of monocular vision alignment system in dynamic environment, including：Object detection module, for detecting in input picture The species of object and position, and export testing result；Missed suppression module, for receiving the testing result, and according to Object is determined as dynamic object or stationary body by kind of object, exports result of determination；Module is positioned and builds, for receiving Result of determination is stated, and rejects the dynamic object in the picture.

In some exemplary embodiments of the invention, the missed suppression module includes：Priori module, including dynamic Object decision model, for dynamic characteristic fraction corresponding to object of the process decision chart as in；Dynamic determination module, for described in comparison The size of dynamic characteristic fraction and threshold value set in advance, dynamic characteristic fraction are determined as dynamic object higher than the object of threshold value, Dynamic characteristic fraction is determined as stationary body less than the object of threshold value.

In some exemplary embodiments of the invention, the dynamic object decision model is used for more with reference to priori setting The dynamic characteristic fraction of kind object.

In some exemplary embodiments of the invention, in addition to missing inspection compensating module, for according to each in adjacent two field picture It whether there is the object of missing inspection in the position coordinates detection image of individual object.

In some exemplary embodiments of the invention, the object detection module is used for what is formed using multilayer neural network The species of each object in detection of classifier image；The multilayer neural network is SSD object detection networks, using VGG16 Infrastructure network, before reservation 5 layers it is constant, fc6 and fc7 layers change into two convolutional layers, and increase by three convolutional layers and one Individual average pond layer.

In some of the invention exemplary embodiments, the positioning includes tracking module with building module, build module and Winding detection module；The tracking module is used to extract ORB characteristic points to input picture, according to the result of determination to characteristic point Classification, rejects the characteristic point on dynamic object, only retains the characteristic point on stationary body, and judge that the input picture is It is no to be used as key frame to add Key Frames List；The point map that module is built for being observed using key frame and key frame is held Row boundling adjusting and optimizing；The winding detection part is used to eliminate positioning and builds accumulated error of the module under large scene.

A kind of monocular visual positioning method in dynamic environment, including：Detect the species of object and position in current frame image Put coordinate；Object is determined as by dynamic object or stationary body according to kind of object；The dynamic is rejected in current frame image Object.

It is described that object is determined as by dynamic object or static state according to kind of object in some exemplary embodiments of the invention Object further comprises：According to dynamic characteristic fraction corresponding to object of the priori process decision chart as in；It is special to compare the dynamic Property fraction and threshold value set in advance size, dynamic characteristic fraction is determined as dynamic object higher than the object of threshold value, dynamic Characteristic fraction is determined as stationary body less than the object of threshold value.

In some exemplary embodiments of the invention, in addition to：Detect the object that whether there is missing inspection in current frame image； Wherein, detection formula is：If in the presence ofThen There is no missing inspection, otherwise, by X_0jAdded as missing inspection object in the testing result of present frame, in formula, X_1iTo appoint in current frame image The coordinate of one object, X_0jFor the coordinate of any object in previous frame image, v_threshold is the threshold of dynamic object movement velocity Value, FPS is frame per second.

It is described the dynamic object is rejected in current frame image further to wrap in some of the invention exemplary embodiments Include：ORB characteristic points are extracted to current frame image；Characteristic point is carried out according to the result of determination of the dynamic object and stationary body Classification；The characteristic point on dynamic object is rejected, retains the characteristic point on stationary body.

(3) beneficial effect

(1) dynamic object detected from semantic level, it is whether unrelated in motion with its current time.Thing will dynamically be regarded as The essential attribute of body, rather than the state at a certain moment, robot localization can be efficiently solved in prolonged uniformity.

(2) missing inspection compensation process is added, improves the precision of object detection, so as to stabilization, effectively rejects and moves State characteristic point.

(3) dynamic object is detected from semantic level using depth convolutional neural networks, is operated by the pondization of many levels, The Image Feature Detection of different scale can be come out so that network is able to detect that the object of different scale, solves tradition The problem of medium-and-large-sized dynamic object of method can not be detected successfully.

Brief description of the drawings

Fig. 1 is existing field flow method flow chart.

Semantic vision positioning system flow chart is combined under the dynamic environment of Fig. 2 embodiment of the present invention.

Semantic each module detailed process of vision positioning system system is combined under the dynamic environment of Fig. 3 embodiment of the present invention Figure.

The SSD network structures of Fig. 4 embodiment of the present invention.

The familiar object of Fig. 5 embodiment of the present invention location schematic diagram on dynamic characteristic section.

Fig. 6 show the monocular visual positioning method flow chart in the dynamic environment of the embodiment of the present invention.

Embodiment

For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with specific embodiment, and reference Accompanying drawing, the present invention is described in further detail.

First embodiment of the invention provides the monocular vision alignment system in a kind of dynamic environment, is illustrated in figure 2 this hair Monocular vision positioning system structure schematic diagram in the dynamic environment of bright embodiment, including object detection module, missed suppression mould Block and SLAM are positioned and are built module, wherein, it is in place that object detection module is used to detect the species of object and institute in input picture Put, and export testing result；Missed suppression module is used for the testing result for receiving object detection module output, and according to object kind Object is determined as dynamic object or stationary body by class, exports result of determination；Module is positioned and builds to be used to receive missed suppression The result of determination of module output, and dynamic object is rejected in the picture, retain stationary body.

Positioned as SLAM using monocular camera real-time image acquisition, and using the image and build module and object detection mould The input of block, the Real-time Feedback after missed suppression module that exports of object detection module position to SLAM and build module, SLAM positions and built module and finally provides positioning and build figure result.Detailed process inside modules is as shown in Figure 3.

Object detection module receives present frame as input first, the grader formed by a multilayer neural network, Export the classification and its position coordinate of each object detected.Multilayer neural network uses SSD (Single Shot MultiBox Detector) object detection network, Fig. 4 show SSD object detection network model figures, as shown in figure 4, the net Network use VGG16 infrastructure network, before reservation 5 layers it is constant, then fc6 and fc7 layers are changed into using astrous algorithms Two convolutional layers, finally additionally increase by three convolutional layers and an average pond layer again.Information using heterogeneous networks layer carrys out mould Intend the characteristics of image under different scale, final testing result is obtained finally by non-maximum suppression.Due to having given up initially Candidate frame generation phase so that whole object testing process can be completed under single network, so as to realize higher detection effect Rate (46fps, Titan X) and accuracy of detection (77.2%).

After obtaining object detection result, due to the possibility of missing inspection be present, therefore, in some embodiments of the invention, also wrap Missing inspection compensating module is included, for whether there is missing inspection in the position coordinates detection image according to each object in adjacent two field picture Object.Because in dynamic environment SLAM, the success or not of dynamic object detection directly determines other modules of system Whether can normally perform.Once generation missing inspection, the greatest differences between adjacent two figures will cause characteristic point quantity drastically to become Change, so as to cause the unstable of system.In order to it is stable, effectively reject behavioral characteristics point, it is necessary to object detection this Step obtains sufficiently high accuracy of detection.In the object detection task of routine, due to being associated between each picture without obvious, Therefore accuracy of detection can not be improved by contextual information.But in SLAM, because frame of video arrives at according to time series, we Can by the testing result prediction testing result next time of preceding some frames, so as to make up the missing inspection that is likely to occur next time or Flase drop.The missing inspection compensating module includes consecutive frame missing inspection compensation model, and consecutive frame missing inspection compensation model provided by the invention is based on One rational hypothesis：" movement velocity of dynamic object is not over the value that some is fixed.”.The seat of dynamic object is represented with X Mark, v_threshold represent the threshold value of dynamic object movement velocity, and FPS represents frame per second, should meet Δ X ＜ v_ between them Threshold/FPS relation.When setting v_threshold, what can neither be set is too small, too small that system can be made excessively quick Sense, cause correct detection to be recognized and do missing inspection；What can not be set is too big, the too big detection zone that may then make multiple dynamic objects It is overlapping.If in the presence ofNo missing inspection is then thought, otherwise it is assumed that there is missing inspection, by X_0jMake Added for missing inspection object in the testing result of present frame.Finally, revised experiment result list is as missed suppression module Initial data.

Missed suppression module receives the testing result of object detection module output, and obtains the dynamic of object according to kind of object Step response fraction, judge that object is dynamic object or stationary body according to dynamic characteristic fraction.Missed suppression module receives detection The species and position coordinates of each object arrived, dynamic object judgement is carried out to these objects with reference to priori, so as to extract Go out dynamic object therein.

Missed suppression module specifically includes：Priori module and dynamic determination module, priori module include dynamic Object decision model, for dynamic characteristic fraction corresponding to each object of the process decision chart as in；Dynamic determination module, for comparing Size between the dynamic characteristic fraction of object and threshold value set in advance, dynamic characteristic fraction are determined as higher than the object of threshold value Dynamic object, dynamic characteristic fraction are determined as stationary body less than the object of threshold value.Wherein, dynamic object decision model is used to tie Close the dynamic characteristic fraction that priori sets a variety of objects.

The present invention proposes the dynamic object decision method based on priori in the aspect of semanteme.The language of environmental objects Justice is the explanation that people is made based on experience to environment.People in foreign environment is not known nothing surrounding environment in fact, Scenery at the moment can be divided into building, vegetation, vehicle, pedestrian etc. by the priori of people, ignore the fortune such as vehicle, pedestrian automatically Dynamic object, while remember the static object such as building, vegetation, this is the talent that people handles dynamic environment.And SLAM systems If not understanding the environment of surrounding from semantic level, it is dynamic which, which just can not really be distinguished, and which is static, Zhi Neng The object of motion is found out in short time, and prolonged uniformity can not be ensured.Therefore, we are by the result of object detection and elder generation Test knowledge to be combined, provide dynamic object decision model.According to the priori of people, the dynamic characteristic of object is scored, 0 point is Stationary body, 10 points are dynamic object, and familiar object location on the section is substantially as shown in Figure 5.By object fraction Compared with the threshold value of a predefined, fraction is determined as dynamic object when being higher than threshold value, is then determined as during less than threshold value quiet State object.The size of threshold value is given by experience, is typically set to 5.

As shown in figure 3, SLAM positioning is divided into three parts with building module, it is tracing module (Tracking), office respectively Portion builds module (Local Mapping) and winding detection module (Loop Closing).

Tracing module can be as a front-end vision odometer based on ORB features.ORB is extracted to input picture first Characteristic point simultaneously calculates description, then ORB characteristic points is classified according to the result of determination of missed suppression module, rejecting is located at Characteristic point on dynamic object, only retain the characteristic point on stationary body.Followed by tracking process, son is described using ORB Characteristic matching is carried out with previous key frame, adjusts (Bundle Adjustment) method estimation camera pose using boundling, and estimate Point map position is counted, establishes local map.Finally, determine whether present frame is crucial as key frame addition according to overlay region size Frame list.

Part builds module and is used to optimize the pose that tracing module calculates and establish the map of 3D points.First will In key frame insertion map, for each new key frame, the characteristic point on the trigonometric ratio frame, 3D point map is obtained.So The 3D point maps that the pose of several key frames in localized region and these frames observe afterwards carry out local boundling adjustment (office Portion BA) so that re-projection error of all 3D points that can observe under camera pose is minimum in key frame.Finally, analyze excellent Key frame after change, if distance is too near or parallax is too small, reject the key frame.

Winding detection module is used to eliminate accumulated errors of the SLAM under large scene.Including winding detection and winding amendment, First with vision bag of words, the feature of present frame and the feature of all key frames are compared, if between description Hamming distances are less than some threshold value, then it is assumed that find winding, now the connected mode of local pose figure is changed, and system passes through one Secondary pose figure optimization is so as to further reduce the accumulated error of system.

Second embodiment of the invention provides the monocular visual positioning method in a kind of dynamic environment, is illustrated in figure 6 this hair Monocular visual positioning method flow chart in the dynamic environment of bright embodiment, including：

Step S1：Detect the species and position coordinates of object in current frame image.

Multilayer is used using monocular camera real-time image acquisition, and using the image as input picture, the embodiment of the present invention Neutral net form grader the object in current frame image is detected, export each object detected classification and Its position coordinate.Multilayer neural network uses SSD (Single Shot MultiBox Detector) object detection net Network, Fig. 4 show SSD object detection network model figures, as shown in figure 4, Web vector graphic VGG16 infrastructure network, is protected Stay first 5 layers it is constant, fc6 and fc7 layers are then changed into two convolutional layers using astrous algorithms, finally extra increase by three again Individual convolutional layer and an average pond layer.The characteristics of image under different scale is simulated using the information of heterogeneous networks layer, finally Final testing result is obtained by non-maximum suppression.Due to having given up initial candidate frame generation phase so that whole object Testing process can be completed under single network, so as to realize higher detection efficiency (46fps, Titan X) and accuracy of detection (77.2%).

In some embodiments of the invention, in addition to step：Detect the object that whether there is missing inspection in current frame image.Inspection Survey process is specific as follows：

(1) present frame K1 enters SSD networks, exports the object list detected, and each single item in list includes detecting Object type and position coordinates X_li(0 ＜ i ＜ n1, n₁For K₁The quantity of testing result).

(2) if for each single item X in former frame K0 testing result_0j(0 ＜ j ＜ n₀, n₀For K₀The number of testing result Amount), if in present frame testing result in the presence ofNo missing inspection is then thought, if being not presentThen think missing inspection occur, now think missing inspection occur, it is necessary to X_0jIt is added to current In the experiment result list of frame.

Step S2：Object is determined as by dynamic object or stationary body according to kind of object.

Dynamic object judgement is carried out to the step S1 each objects for detecting to obtain with reference to priori, so as to extract wherein Dynamic object, step S2 further comprises following sub-step：

Sub-step S21：According to dynamic characteristic fraction corresponding to object of the priori process decision chart as in；

Sub-step S22：Compare the size of the dynamic characteristic fraction and threshold value set in advance, dynamic characteristic fraction is high It is determined as dynamic object in the object of threshold value, dynamic characteristic fraction is determined as stationary body less than the object of threshold value.

According to the priori of people, the dynamic characteristic of object is scored, 0 point is stationary body, and 10 points are dynamic object, will For object fraction compared with the threshold value of a predefined, fraction is determined as dynamic object when being higher than threshold value, during less than threshold value then It is determined as stationary body.The size of threshold value is given by experience, is typically set to 5.

Step S3：The dynamic object is rejected in current frame image.

Step S3 further comprises following sub-step：

Sub-step S31：ORB characteristic points are extracted to current frame image；

Sub-step S32：Characteristic point is classified according to the result of determination of the dynamic object and stationary body；

Sub-step S33：The characteristic point on dynamic object is rejected, retains the characteristic point on stationary body.

In some embodiments of the invention, the monocular visual positioning method in dynamic environment also includes：

Step S4：Judge the input picture whether as key frame addition Key Frames List.

Sub and previous key frame is described using ORB and carries out characteristic matching, is adjusted (Bundle Adjustment) using boundling Method estimates camera pose, and estimates point map position, establishes local map.Finally, present frame is determined according to overlay region size Whether as key frame addition Key Frames List.

Step S5：The point map observed using key frame and key frame performs boundling adjusting and optimizing.

Boundling adjusting and optimizing is performed using the key frame near present frame and 3D point maps so that all considerable in key frame Re-projection error of the 3D points measured under camera pose is minimum.

Step S6：Eliminate positioning and build accumulated error of the module under large scene.

Accumulated errors of the SLAM under large scene is eliminated, it is using vision bag of words, the feature of present frame is relevant with institute The feature of key frame compares, if the Hamming distances between description are less than some threshold value, then it is assumed that winding is found, it is now local The connected mode of pose figure is changed, and system is by a pose figure optimization so as to further reduce the accumulated error of system.

Particular embodiments described above, the purpose of the present invention, technical scheme and beneficial effect are carried out further in detail Describe in detail bright, it should be understood that the foregoing is only the present invention specific embodiment, be not intended to limit the invention, it is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvements done etc., the protection of the present invention should be included in Within the scope of.

Claims

1. the monocular vision alignment system in a kind of dynamic environment, including：

Object detection module, for detecting the species of object and position in input picture, and export testing result；

Missed suppression module, object is determined as dynamic object or quiet for receiving the testing result, and according to kind of object State object, export result of determination；

Module is positioned and built, for receiving the result of determination, and rejects the dynamic object in the picture.

2. the monocular vision alignment system in dynamic environment according to claim 1, wherein, the missed suppression module bag Include：

Priori module, including dynamic object decision model, for dynamic characteristic fraction corresponding to object of the process decision chart as in；

Dynamic determination module, for the dynamic characteristic fraction and the size of threshold value set in advance, dynamic characteristic fraction Object higher than threshold value is determined as dynamic object, and dynamic characteristic fraction is determined as stationary body less than the object of threshold value.

3. the monocular vision alignment system in dynamic environment according to claim 2, wherein, the dynamic object judges mould Type is used to combine the dynamic characteristic fraction that priori sets a variety of objects.

4. the monocular vision alignment system in dynamic environment according to claim 1, wherein, in addition to missing inspection compensation mould Block, for whether there is the object of missing inspection in the position coordinates detection image according to each object in adjacent two field picture.

5. the monocular vision alignment system in dynamic environment according to claim 1, wherein, the object detection module is used The species of each object in the detection of classifier image formed using multilayer neural network；

The multilayer neural network is SSD object detection networks, using VGG16 infrastructure network, before reservation 5 layers it is constant, Fc6 and fc7 layers change into two convolutional layers, and increase by three convolutional layers and an average pond layer.

6. the monocular vision alignment system in dynamic environment according to claim 1, wherein, the positioning is with building module Including tracking module, build module and winding detection module；

The tracking module is used to extract ORB characteristic points to input picture, according to the result of determination to characteristic point classification, rejects Characteristic point on dynamic object, only retain the characteristic point on stationary body, and judge the input picture whether as pass Key frame adds Key Frames List；

The point map execution boundling adjusting and optimizing built module and be used to observe using key frame and key frame；

The winding detection part is used to eliminate positioning and builds accumulated error of the module under large scene.

7. the monocular visual positioning method in a kind of dynamic environment, including：

Detect the species and position coordinates of object in current frame image；

Object is determined as by dynamic object or stationary body according to kind of object；

The dynamic object is rejected in current frame image.

8. the monocular visual positioning method in dynamic environment according to claim 7, wherein, it is described to be incited somebody to action according to kind of object Object is determined as that dynamic object or stationary body further comprise：

According to dynamic characteristic fraction corresponding to object of the priori process decision chart as in；

Compare the size of the dynamic characteristic fraction and threshold value set in advance, object of the dynamic characteristic fraction higher than threshold value is sentenced It is set to dynamic object, dynamic characteristic fraction is determined as stationary body less than the object of threshold value.

9. the monocular visual positioning method in dynamic environment according to claim 7, wherein, in addition to：Detect present frame It whether there is the object of missing inspection in image；

Wherein, detection formula is：If in the presence of There is no missing inspection then, otherwise, by X_0jAdded as missing inspection object in the testing result of present frame, in formula, X_1iFor in current frame image The coordinate of any object, X_0jFor the coordinate of any object in previous frame image, v_threshold is dynamic object movement velocity Threshold value, FPS are frame per second.

10. the monocular visual positioning method in dynamic environment according to claim 7, wherein, it is described in current frame image The middle rejecting dynamic object further comprises：

ORB characteristic points are extracted to current frame image；

Characteristic point is classified according to the result of determination of the dynamic object and stationary body；

The characteristic point on dynamic object is rejected, retains the characteristic point on stationary body.