CN111768449B

CN111768449B - Object grabbing method combining binocular vision with deep learning

Info

Publication number: CN111768449B
Application number: CN201910254109.0A
Authority: CN
Inventors: 曾洪庆; 钱超超
Original assignee: Beijing Vizum Intelligent Technology Co ltd
Current assignee: Beijing Vizum Intelligent Technology Co ltd
Priority date: 2019-03-30
Filing date: 2019-03-30
Publication date: 2024-05-14
Anticipated expiration: 2039-03-30
Also published as: CN111768449A

Abstract

The invention discloses an object grabbing method combining binocular vision with deep learning, which comprises the following steps: collecting binocular images; respectively carrying out target recognition on the left image and the right image to obtain target area information; calculating a region characteristic value according to the information of each target region, and matching left and right targets; calculating the pose of the target by using the target area information of the left image and the right image and the matching relation; the mechanical executing mechanism performs grabbing. According to the invention, the self-adaptive deep learning algorithm model binocular vision is combined, the self-adaptive deep learning algorithm model is utilized for carrying out feature matching, so that more accurate matching features and matching relations are obtained, and further, the binocular vision calculation result is more accurate and stable, thereby improving the application efficiency and reliability of the mechanical arm for positioning and grabbing objects.

Description

Object grabbing method combining binocular vision with deep learning

Technical Field

The invention belongs to the technical field of mechanical arm positioning grabbing application, and particularly relates to an object grabbing method combining binocular vision with deep learning.

Background

The mechanical arm positions and grabs the object, so that the application efficiency and reliability of the mechanical arm are determined, the object can be identified and positioned based on binocular stereoscopic vision, the object position information can be obtained rapidly, and the mechanical arm positions and grabs the object. Binocular stereo vision is an important branch of computer vision, two cameras are utilized to shoot the same object from different positions to obtain two images, corresponding points in the two images are found out through a matching algorithm, parallax is obtained through calculation, and then distance information of the object in the real world is recovered based on a triangulation principle. In practical use, each matching algorithm has poor extracted matching characteristics due to own defects, and the difficulty of extracting the matching characteristics is increased when the texture missing object is processed, so that the matching effect is imperfect.

Deep learning can utilize supervised training to automatically learn extracted useful features, so that the features can be more abstract and represented at a high level, and the capability of distributed and parallel computing is the greatest advantage. The deep learning is applied to the matching process of binocular vision, fills the defects of common binocular vision, and has high practical value.

Disclosure of Invention

Aiming at the problems, the invention provides an object grabbing method combining binocular vision with deep learning. The technical scheme adopted by the invention is as follows:

An object grabbing method combining binocular vision with deep learning, comprising the following steps: collecting binocular images; respectively carrying out target recognition on the left image and the right image to obtain target area information; calculating a region characteristic value according to the information of each target region, and matching left and right targets; calculating the pose of the target by using the target area information of the left image and the right image and the matching relation; the mechanical executing mechanism performs grabbing.

Further, the acquiring the binocular image includes: performing three-dimensional calibration on the binocular camera; respectively acquiring a left image and a right image of a target object through a left camera and a right camera of the binocular camera; and carrying out polar correction on the left image and the right image, and aligning the corrected left image and right image lines.

Further, the performing object recognition on the left image and the right image respectively, and obtaining the object area information includes: cutting the image size to a specified size; inputting the data into a self-adaptive deep learning algorithm for processing; and outputting a detection result as a basis of subsequent matching.

Further, the adaptive deep learning algorithm is based on a classical target detection algorithm SSD, and at the original algorithm CONV4_3 layer, the FPN algorithm idea is utilized to up-sample the multi-stage Feature Maps so as to improve the small target detection precision.

Further, calculating the region characteristic value according to the target region information, and performing matching of the left and right targets includes: calculating a reference anchor point according to the region information of the left image and the right image; calculating characteristic information P of each piece of region information according to the anchor points; left and right object matching.

Further, the calculating the reference anchor point according to the region information of the left and right images includes: the calculation of the anchor point is completed by the size of each area and the center point thereof, and the specific method is as follows: where Qi is the target area size and Ki is the target area center.

Further, the calculating the characteristic information P of each block of region information according to the anchor point includes: the anchor point information (x ₀,y₀) and the region information (x, y, w, h, t) are used to calculate coordinate offset information (x-x ₀,y-y₀) and region information (w x h, t) to form feature information P (x-x ₀,y-y₀, w x h, t).

Further, the left-right object matching includes: the characteristic information P is regarded as four-dimensional vectors, the four-dimensional vectors are multiplied by corresponding weights respectively, then the Euclidean distance between the two vectors is calculated to be regarded as the final difference degree, and a WTA (Winner Take ALL) algorithm is used for obtaining matching combination according to the difference degree.

The beneficial effects of the invention are as follows: the self-adaptive deep learning algorithm model is combined with binocular vision, the self-adaptive deep learning algorithm model is utilized for feature matching, and more accurate matching features and matching relations are obtained, so that binocular vision calculation results are more accurate and stable, and the application efficiency and reliability of the mechanical arm for positioning and grabbing objects are improved.

Drawings

Fig. 1 is a schematic flow chart of an object grabbing method combining binocular vision with deep learning.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.

Referring to fig. 1, the embodiment of the present invention specifically includes the following steps:

(1) And (5) carrying out three-dimensional calibration on the binocular camera.

The method specifically comprises the following steps: calibrating a left camera and a right camera of a binocular camera respectively to obtain an internal reference matrix A of the binocular camera, a rotation matrix R ₁ of the left camera, a rotation matrix R ₂ of the right camera, a translation vector T ₁ of the left camera and a translation vector T ₂ of the right camera; and calculating a rotation matrix R and a translation vector T between the left camera and the right camera according to the following formula:

(2) And respectively acquiring a left image and a right image of the target object through a left camera and a right camera of the binocular camera.

(3) And carrying out polar correction on the left image and the right image, and aligning the corrected left image and right image lines.

The method specifically comprises the following steps: decomposing the rotation matrix R into two rotation matrices R ₁ and R ₂, wherein R ₁ and R ₂ are obtained by assuming that the left camera and the right camera are rotated half way to make the optical axes of the left camera and the right camera parallel;

the line alignment of the left image and the right image is achieved by:

wherein R _rect is a rotation matrix aligning rows:

The rotation matrix R _rect starts from the direction of the pole e ₁, and takes the origin of the left image as the main point direction, and the direction of the translation vector from the left camera to the right camera is the main point direction:

e ₁ is orthogonal to e ₂, normalizing e ₁ to a unit vector:

Wherein, T _x is the component of the translation vector T in the horizontal direction in the plane of the binocular camera, and T _y is the component of the translation vector T in the vertical direction in the plane of the binocular camera;

e ₃ is orthogonal to e ₁ and e ₂, and e ₃ is calculated by the following formula:

e₃＝e₂×e₁

the physical meaning of the rotation matrix is as follows:

Alpha is expressed as an angle for aligning rows, and the left camera and the right camera need to rotate in a plane where the left camera and the right camera are positioned, wherein alpha is more than or equal to 0 and less than or equal to 180 degrees; for the left camera it is rotated a 'about e ₃ and for the right camera it is rotated a' about e ₃.

(4) And respectively carrying out target recognition on the left image and the right image to obtain target area information.

The method specifically comprises the following steps: clipping the image size to 300×300mm; inputting the cut image into a self-adaptive deep learning algorithm for processing; and outputting a detection result as a basis of subsequent matching.

(5) And calculating a reference anchor point according to the information of each target area.

The calculation of the anchor point is completed by the size of each area and the center point thereof, and the specific method is as follows: where Qi is the target area size and Ki is the target area center.

(6) And calculating the characteristic information P of each piece of area information according to the anchor points.

The method specifically comprises the following steps: the anchor point information (x ₀,y₀) and the region information (x, y, w, h, t) are used to calculate coordinate offset information (x-x ₀,y-y₀) and region information (w x h, t) to form feature information P (x-x ₀,y-y₀, w x h, t).

(7) And performing left-right matching according to the obtained characteristic information P.

The method specifically comprises the following steps: the characteristic information P is regarded as four-dimensional vectors, the four-dimensional vectors are multiplied by corresponding weights respectively, then the Euclidean distance between the two vectors is calculated to be regarded as the final difference degree, and a WTA (Winner Take ALL) algorithm is used for obtaining matching combination according to the difference degree.

(8) And calculating the three-dimensional coordinates of the feature points according to the binocular stereoscopic vision principle by using the obtained matching relation. The method specifically comprises the following steps:

Setting the left camera O-xyz to be positioned at the origin of a world coordinate system, wherein no rotation occurs, the image coordinate system is O _l-X₁Y₁, and the effective focal length is f _l; the right camera coordinate system is O _r -xyz, the image coordinate system is O _r-X_rY_r, and the effective focal length is f _r. Then we can get the following relation from the projection model of the camera:

Because the positional relationship between the O-xyz coordinate system and the O _r-x_ry_rz_r coordinate system can be expressed by the space transformation matrix M _Lr:

Similarly, for a spatial point in the O-xyz coordinate system, the correspondence between two camera surface points can be expressed as:

Thus, the spatial point three-dimensional coordinates can be expressed as:

Therefore, the three-dimensional space coordinates of the measured point can be reconstructed only by obtaining the parameters/focal lengths f _r,f_l in the left and right computers and the image coordinates of the space point in the left and right cameras through the computer calibration technology.

(9) And the mechanical executing mechanism determines the position of the object according to the acquired three-dimensional coordinates and grabs the object.

Claims

1. An object grabbing method combining binocular vision with deep learning is characterized by comprising the following steps of: collecting binocular images; respectively carrying out target recognition on the left image and the right image to obtain target area information; calculating a region characteristic value according to the information of each target region, and matching left and right targets; calculating the pose of the target by using the target area information of the left image and the right image and the matching relation; the mechanical executing mechanism performs grabbing;

calculating the region characteristic value according to the information of each target region, and matching the left and right targets comprises the following steps: calculating a reference anchor point according to the region information of the left image and the right image; calculating characteristic information P of each piece of region information according to the anchor points; matching left and right targets;

The method comprises the steps of calculating reference anchor points according to the region information of the left image and the right image, wherein the calculation of the anchor points is completed by the size of each region and the center point of each region, and the specific method is as follows:

Where Qi is the target area size and Ki is the target area center;

the calculating the characteristic information P of each block of region information according to the anchor point comprises the following steps: the anchor point information (x ₀,y₀) and the region information (x, y, w, h, t) are used to calculate coordinate offset information (x-x ₀,y-y₀) and region information (w x h, t) to form feature information P (x-x ₀,y-y₀, w x h, t).

2. The binocular vision combined deep learning object gripping method of claim 1, wherein the acquiring binocular images comprises: performing three-dimensional calibration on the binocular camera; respectively acquiring a left image and a right image of a target object through a left camera and a right camera of the binocular camera; and carrying out polar correction on the left image and the right image, and aligning the corrected left image and right image lines.

3. The object capturing method of binocular vision combined with deep learning according to claim 1, wherein the performing object recognition on the left and right images, respectively, to obtain object region information includes: cutting the image size to a specified size; inputting the data into a self-adaptive deep learning algorithm for processing; and outputting a detection result as a basis of subsequent matching.

4. The object capturing method combining binocular vision with deep learning according to claim 3, wherein the adaptive deep learning algorithm is based on a classical object detection algorithm SSD, and the multi-stage Feature Maps are up-sampled by using the FPN algorithm idea in the original algorithm conv4_3 layer to improve the small object detection accuracy.

5. The object capturing method of binocular vision combined with deep learning according to claim 1, comprising: the characteristic information P is regarded as four-dimensional vectors, the four-dimensional vectors are multiplied by corresponding weights respectively, then the Euclidean distance between the two vectors is calculated to be regarded as the final difference degree, and a WINNER TAKE ALL algorithm is used for obtaining matching combination according to the difference degree.