CN110517306B

CN110517306B - Binocular depth vision estimation method and system based on deep learning

Info

Publication number: CN110517306B
Application number: CN201910814513.9A
Authority: CN
Inventors: 秦豪
Original assignee: Dilu Technology Co Ltd
Current assignee: Dilu Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2023-07-28
Anticipated expiration: 2039-08-30
Also published as: CN110517306A

Abstract

The invention discloses a binocular depth vision estimation method and a binocular depth vision estimation system based on deep learning, comprising the following steps of collecting training data; the depth generation module generates a depth distance corresponding to the position of the picture, and stores the depth distance as a depth picture with depth information, the size of which is consistent with that of the original picture, and the pixel value of the picture corresponds to the relative distance; training a neural network model; the depth estimation results in an estimated depth distance map. The invention has the beneficial effects that: the binocular vision depth estimation method based on the deep learning is high in accuracy, high in generalization capability, capable of performing transfer learning, and capable of greatly improving operation speed compared with the traditional algorithm operation time in speed when being applied to different environmental conditions.

Description

Binocular depth vision estimation method and system based on deep learning

Technical Field

The invention relates to the technical field of depth distance measurement by a binocular camera, in particular to a binocular depth vision estimation method and a binocular depth vision estimation system based on deep learning.

Background

In recent years, obtaining the distance between environmental objects through depth estimation is an important field in computer vision. Similar to two eyes of a human, three-dimensional information in the environment is reconstructed through a binocular camera, and an estimation of the distance between objects in the environment is obtained. The binocular vision depth estimation method based on the traditional computer vision, such as SGM algorithm, has the characteristics of low precision and too low speed, and the algorithm has high dependence on environment, has poor robustness in complex scenes, and is difficult to meet the requirements of commercial landing. The binocular vision depth estimation method based on the deep learning has the characteristics of high precision, strong generalization capability, high speed and the like.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.

The present invention has been made in view of the above-described problems occurring in the prior art.

Therefore, one technical problem solved by the present invention is: the binocular depth vision estimation method based on the deep learning is provided for obtaining the distance between the environmental object and the object more accurately.

In order to solve the technical problems, the invention provides the following technical scheme: a binocular depth vision estimation method based on deep learning comprises the following steps of collecting training data, and obtaining two initial pictures with different visual angles by a camera module; the depth generation module generates a depth distance corresponding to the position of the picture, and stores the depth distance as a depth picture with depth information, the size of which is consistent with that of the original picture, and the pixel value of the picture corresponds to the relative distance; training a neural network model, namely inputting the depth picture into the neural network model for training, and obtaining and storing trained neural network parameters through iterative training; and estimating the depth, wherein the camera module acquires an actual picture, and inputs the actual picture into the trained neural network model for calculation to obtain an estimated depth distance map.

As a preferred embodiment of the deep learning-based binocular depth vision estimation method of the present invention, the method comprises: the neural network model comprises a convolutional neural network, a feature fusion network layer and a 3D convolutional neural network layer, and comprises the following training steps that pictures acquired by the camera module are used as input; obtaining a characteristic diagram of the two diagrams through a convolutional neural network; the output of the convolutional neural network layer is used as the input of the feature fusion network layer, and fusion features are extracted; and putting the depth map into a 3D convolutional neural network layer to extract the depth map.

As a preferred embodiment of the deep learning-based binocular depth vision estimation method of the present invention, the method comprises: the acquired depth picture is put into the neural network model, and picture characteristics are extracted through the convolutional neural network; and inputting the three-dimensional feature fusion data to the feature fusion network layer to perform feature fusion, and matching and fusing the related features to generate a 3D feature map.

As a preferred embodiment of the deep learning-based binocular depth vision estimation method of the present invention, the method comprises: the model training further comprises the steps of performing a 3D convolution on the 3D feature map, the convolution kernel size being 3 x 3, obtaining a fusion feature of the position and depth on the 3D feature map, the feature map being 1/4 of the original size, therefore, the picture is up-sampled to the original size to obtain a depth picture with the same size as the picture, and for each pixel point in the picture, a group of depth signals with the size of d=48 is corresponding, and the group of signals is normalized, and the function is defined as follows:

v is the corresponding depth signal and S is the normalized depth signal:

and multiplying the obtained normalized signal by the corresponding signal weight to obtain depth parallax information of the corresponding position.

As a preferred embodiment of the deep learning-based binocular depth vision estimation method of the present invention, the method comprises: further comprising the step of a network update phase,

comparing the obtained parallax image with a real parallax image, namely, a depth image acquired by a depth camera, and obtaining a loss value of a network by adopting a smooth L1 loss function, wherein the loss function formula is as follows, and x is a data difference value of a corresponding position:

back-propagating the loss value to update and iterate the parameters of the whole neural network; repeating the above process until the network parameter update is smaller, and repeatedly performing the training for more than one time to obtain no better test result, namely judging that the training tends to be saturated, and finishing the training.

As a preferred embodiment of the deep learning-based binocular depth vision estimation method of the present invention, the method comprises: the convolutional neural network comprises the following training steps that two depth pictures are simultaneously put into a residual network layer of the convolutional neural network to extract a picture feature map; and the feature map is placed into a spatial pyramid pooling layer for feature enhancement, so that richer feature information is obtained.

As a preferred embodiment of the deep learning-based binocular depth vision estimation method of the present invention, the method comprises: the feature fusion network layer comprises the following steps that features extracted by a convolution layer of a convolution neural network are used as input; inputting the rich characteristic information content of the convolutional depth fusion layer; the depth information fusion layer generates an information layer with depth information matching.

Binocular deep as the deep learning based on the inventionA preferred embodiment of the method of visual estimation of the degree, wherein: the 3D convolutional neural network layer comprises the following steps of taking the output of the characteristic fusion network layer as input; inputting a Hourgass module to extract richer deep high-dimensional information; obtaining a depth module with the size of the original image through the up-sampling layer; the size is D W H, where meaning is a graph of D size, assuming that the ith graph W _j ,H _k The pixel value of (a) is A _ijk The output on the corresponding depth map is: d (D) _jk ＝∑A _ijk *i(i＝0,1,2……D)。

The invention solves the other technical problem that: the binocular depth vision estimation system based on the deep learning is provided, and the method can be realized by means of the system.

In order to solve the technical problems, the invention provides the following technical scheme: the device comprises a camera module, a depth generation module and a neural network model; the camera module is a camera fixedly arranged on the binocular camera and is used for acquiring pictures with two different visual angles; the depth generation module generates a depth picture according to the acquired picture, and the relative distance between the depth picture and the camera corresponds to the pixel value of the image; the neural network model utilizes the acquired pictures to carry out deep learning and save the neural network parameters for generating an estimated depth distance map.

As a preferred embodiment of the system for binocular depth vision estimation based on deep learning according to the present invention, wherein: the camera module comprises 2 groups of cameras, namely a color binocular camera and a gray binocular camera, wherein the color binocular camera is used for collecting pictures and is used for training as the input of a neural network, and the gray binocular camera is used for generating a depth map because of better contrast and resolution.

The invention has the beneficial effects that: the binocular vision depth estimation method based on the deep learning is high in accuracy, high in generalization capability, capable of performing transfer learning, and capable of greatly improving operation speed compared with the traditional algorithm operation time in speed when being applied to different environmental conditions.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a schematic overall flow chart of a method for binocular depth vision estimation based on deep learning according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a binocular depth estimation architecture according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a convolutional neural network module layer according to a first embodiment of the present invention;

fig. 4 is a schematic structural diagram of a feature fusion network layer according to a first embodiment of the present invention;

FIG. 5 is a schematic diagram of a 3D convolutional neural network layer according to a first embodiment of the present invention;

fig. 6 is a schematic structural diagram of a system for binocular depth vision estimation based on deep learning according to a second embodiment of the present invention.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.

Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Example 1

Referring to the illustrations of fig. 1 to 5, obtaining the distance between environmental objects by depth estimation is an important field in computer vision. It is similar to two eyes of human beings, and three-dimensional information in the environment is reconstructed through a binocular camera to obtain an estimation of the distance between objects in the environment.

By calculating the parallax of the two images, the distance measurement is directly performed on the front scene (the range where the images are captured) without judging what type of obstacle appears in front. Therefore, for any type of obstacle, the necessary early warning or braking can be performed according to the change of the distance information. The principle of a binocular camera is similar to that of a human eye. The human eye is able to perceive the distance of an object due to the difference in the images presented by both eyes to the same object, also known as "parallax". The farther the object is, the smaller the parallax; whereas the larger the parallax. The magnitude of the parallax corresponds to the distance between the object and the eyes, which is also why 3D movies enable stereoscopic hierarchical perception.

The binocular vision depth estimation method based on the traditional computer vision, such as SGM algorithm, has the characteristics of low precision and too low speed, and the algorithm has high dependence on environment, has poor robustness in complex scenes, and is difficult to meet the requirements of commercial landing. Therefore, the binocular vision depth estimation method based on the deep learning is provided in the embodiment, and has the advantages of high precision, strong generalization capability and high speed.

Specifically, the method comprises the following steps,

s1: acquiring training data, and acquiring two initial pictures 101 with different visual angles by using a camera module 100;

s2: the depth generation module 200 generates a depth distance corresponding to the picture position, and stores the depth distance as a depth picture 201 with depth information, the depth distance is consistent with the size of the original picture, and the picture pixel value corresponds to the relative distance;

s4: training the neural network model 300, inputting the depth picture 201 into the neural network model 300 for training, and obtaining and storing the trained neural network parameters through iterative training; the neural network model 300 comprises a convolutional neural network 301, a feature fusion network layer 302 and a 3D convolutional neural network layer 303, and comprises the following training steps, wherein pictures acquired by the camera module 100 are used as input; the characteristic diagrams of the two diagrams are obtained through a convolutional neural network 301; the output of the convolutional neural network layer 301 serves as the input of the feature fusion network layer 302, and fusion features are extracted; the depth map is extracted by being put into the 3D convolutional neural network layer 303.

More specifically, the method comprises the following steps,

the acquired depth picture 201 is put into a neural network model 300, and picture characteristics are extracted through a convolutional neural network 301; inputting the three-dimensional image to a feature fusion network layer 302 for feature fusion, and matching and fusing the related features to generate a 3D feature map;

then 3D convolution is carried out on the 3D feature map, the convolution kernel size is 3 multiplied by 3, fusion features of the position and the depth on the 3D feature map are obtained, the feature map (the feature map refers to an output result of the 3D convolution at the moment) is 1/4 of the original size (the original size is the original size of a picture acquired by a camera, the purpose of 1/4 is to reduce the calculation amount for picture compression, otherwise the network has difficulty obtaining results in less than 1 s), thus upsampling the picture to the original size to obtain a depth picture that is identical to the picture size (e.g., upsampling by bilinear interpolation), and for each pixel in the picture, normalizing the set of depth signals to a size d=48, the function is defined as follows:

v is the corresponding depth signal and S is the normalized depth signal:

the obtained normalized signal is multiplied by the corresponding signal weight to obtain the depth parallax information of the corresponding position, and the depth information is directly obtained without subsequent processing.

back-propagating the loss value to update and iterate the parameters of the whole neural network;

the above process is repeated until the network parameter update is small (which can be understood as a final depth information graph, and each training is performed, for example, the depth of a certain corresponding point is 100, and then training is performed for many times, which is always about 100, to indicate that the network is not learned any more), and the iterative training is performed for many times, which does not obtain better test results, namely, it is determined that the training tends to be saturated, and the training is completed.

S4: the depth estimation, the camera module 100 collects the actual picture, and inputs the actual picture into the trained neural network model 300 for calculation, so as to obtain the estimated depth distance map.

The training steps of the convolutional neural network 301, the feature fusion network layer 302 and the 3D convolutional neural network layer 303 are sequentially included in this embodiment, specifically.

The convolutional neural network 301 comprises the following training steps, wherein two depth pictures 201 are simultaneously put into a residual network layer of the convolutional neural network 301 to extract picture feature images; and the feature map is placed into a spatial pyramid pooling layer for feature enhancement, so that richer feature information is obtained.

The feature fusion network layer 302 includes the steps of convolving features extracted by a convolutional layer of the neural network 301 as input; inputting the rich characteristic information content of the convolutional depth fusion layer; the depth information fusion layer generates an information layer with depth information matching.

The 3D convolutional neural network layer 303 includes the steps of taking as input the output of the feature fusion network layer 302; inputting a Hourgass module to extract richer deep high-dimensional information; obtaining a depth module with the size of the original image through the up-sampling layer; the size is D W H, where meaning is a graph of D size, assuming that the ith graph W _j ,H _k The pixel value of (a) is A _ijk The output on the corresponding depth map is: d (D) _jk ＝∑A _ijk *i i＝0,1,2……D。

It should be noted that, the present application provides a binocular depth vision estimation method based on deep learning, actually, two pictures with different visual angles are obtained by two cameras at fixed positions on a binocular camera, the two pictures are simultaneously put into a convolutional neural network residual error network to extract a picture feature map, and then the picture feature map is put into a spatial pyramid pooling layer to be enhanced as a feature. Carrying out feature fusion on two picture features with correlation, wherein the purpose of feature fusion is as follows: the feature maps presented by the two pictures at different view angles have relevance because the two pictures differ only in view angle, there are many identical or similar matching features in the feature maps, and the features are fused together for subsequent depth extraction.

The fusion layer neural network is utilized to supervise and learn proper feature matching, and the feature matching is different from the traditional binocular matching algorithm, wherein the feature matching is the matching learned by the neural network, such as relatively poor feature matching, which can lead to poor results, and the reverse supervision neural network is adjusted to better feature matching according to the quality of the results.

Finally, only the effective distance range of the camera is needed to be given, for example, the maximum distance is 200 meters, the neural network can give depth distance information within 0-200 meters on the picture, and the confidence reliability is not high beyond the effective distance range, so that the depth information is set according to 200 meters.

The embodiment provides a binocular vision depth estimation method based on deep learning, which has high precision. On the KITTI data set, an average error pixel percentage of 0.83% can be achieved, whereas the conventional binocular depth estimation method has an error rate of 3.57%. The method for estimating the binocular depth has strong generalization capability, and can transfer and learn the meaning of transfer and learning in the method: the method has the advantages that the method can be applied to different types of binocular cameras by using the same neural network architecture, the whole neural network is not required to be completely retrained, the parameters of the binocular cameras are only required to be set on the original basis, the output parameters of the network layer of the training neural network are only required to be updated, the development time and difficulty can be reduced, the method is applied to pictures with the size of 1242 x 375 under different environment conditions, the average operation time is 0.32 seconds on the speed, compared with the operation time of 3.7 seconds of the traditional algorithm, the operation speed is greatly improved, the requirements of commercial landing are basically met, the practical problems of the general traditional binocular depth estimation scheme are solved, and the method has a great application prospect in the related fields of automatic driving, indoor positioning and the like.

Scene one:

in the embodiment, the test vehicle for deploying the method and the vehicle for deploying the traditional method are subjected to comparison test, python software programming is used for realizing simulation test of the method and the traditional method, the test is performed on the KITTI data set, simulation data are obtained according to experimental results, and the traditional method adopts SGM algorithm and SDM algorithm in the experiment. The test environment is a binocular depth test set of the public data set KITTI, and the traditional method and the method run python software to realize simulation.

And comparing the performance of each algorithm, testing the running speed of each algorithm, calculating the error value of each algorithm, and averaging the estimated errors on the KITTI test set. The experiment was conducted by using the above 2 conventional methods and the present method, and the test results are shown in Table 1 below.

Table 1: and (5) testing results.

Algorithm	Speed of speed	Average error pixel ratio	Average parallax
				This patent	0.32s	1.32％	0.5px
SGM	3.7s	5.76％	1.3px
				SDM	～1min	10.95％	2.0px

Referring to the data in table 1 above, it can be observed that the algorithm of the present implementation is much superior to the conventional algorithm in terms of both speed and accuracy.

Example 2

Referring to the illustration of fig. 6, the present embodiment proposes a binocular depth vision estimation system based on deep learning, the above embodiment can be implemented in dependence on the present system, and the above embodiment method or system can be applied to depth vision estimation of a vehicle. For example, the binocular camera arranged on the vehicle body is used for shooting image information around the vehicle body, meanwhile, the vehicle-mounted host computer is written with a deep learning network algorithm, and the shot image is input into the algorithm module to calculate and estimate the distance between the environmental object and the vehicle-mounted host computer and display the distance on the display screen of the vehicle-mounted host computer, so that the driver can be reminded of safe driving.

Specifically, the system comprises a camera module 100, a depth generation module 200 and a neural network model 300; the camera module 100 is a camera fixedly arranged on the binocular camera and is used for acquiring pictures with two different visual angles; the depth generation module 200 generates a depth picture 201 according to the acquired picture, and the relative distance between the depth picture 201 and the camera corresponds to the pixel value of the image; the neural network model 300 uses the acquired pictures for deep learning to save the neural network parameters for the generation of the estimated depth distance map. The camera module 100 includes 2 sets of cameras, a color binocular camera and a gray binocular camera, wherein the color binocular camera is used for acquiring pictures and training as input of a neural network, and the gray binocular camera is used for generating a depth map because of better contrast and resolution.

It should be noted that, the corresponding two stereo depth gray-scale cameras may generate a depth distance corresponding to the picture position, that is, find a matching point corresponding to the binocular picture, and obtain depth information d=f×b/d according to the pixel difference between the matching points, where f is the focal length of the camera, b is the distance between the binocular two cameras, and d is the pixel difference between the two matching points, that is, the distance between the two matching points and the camera. And the depth information is stored as a picture with the same size as the original picture, for example, only the depth information picture corresponding to the left view is needed to be stored, the pixel value of the picture corresponds to the relative distance, the information of the gray picture is generally composed of any number from 0 to 255, for example, white is 255, black is 1, and if the effective distance of the camera is 1 to 200 meters, the pixel value of the picture can be used for representing the depth information, namely 155 represents that the point of the picture is 155 meters away from the camera.

The network does not know how to perform feature fusion matching in the untrained condition, and through continuous training, the network can give a proper feature matching method to match related features, wherein the related features mean: the left image shows an automobile and the right image shows an automobile, the characteristic images are all provided with relevant information of the automobile, and the characteristic fusion layer is used for matching the relevant automobile information so as to fuse the relevant automobile information into a 3D characteristic image.

The depth generation module 200 in this embodiment may be a calculation module in a depth gray-scale camera, for example, a depth camera adopting RGBD-SLAM, where the detection range includes detection accuracy, detection angle, and frame rate; meanwhile, the module is small in power consumption and low in depth information, the depth information is obtained by relying on a pure software algorithm, and the processing chip has high computing performance, so that the depth information of surrounding objects can be obtained. The embodiment only uses the image as the acquisition of training data, and meanwhile, the method is not difficult to find out, and because of the defects of high calculation performance and slow operation of a processing chip, the method can also utilize common shooting to acquire images and then directly adopt a binocular imaging algorithm on a computer to acquire depth information. After the training of the neural network model 300 is completed, only a general camera is required to be installed on a vehicle body, the neural network model 300 is input to obtain depth information, the required performance is low, the operation speed is high, the neural network model 300 is a written deep learning algorithm chip which is arranged in a vehicle-mounted host, for example, the neural network model 300 can be a deep learning main stream chip of a GPU (graphics processing unit), the whole neural network model is a huge calculation matrix, the GPU has thousands of calculation cores, the application throughput of 10-100 times can be realized, and the neural network model also supports the parallel calculation capability which is vital to the deep learning, so that the neural network model can be faster than the traditional processor, and the training process is greatly accelerated. GPU is one of the most commonly used deep learning arithmetic units at present.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A method for binocular depth vision estimation based on deep learning, characterized by: comprises the steps of,

acquiring training data, and acquiring two initial pictures (101) with different visual angles by a camera module (100);

the depth generation module (200) generates a depth distance corresponding to the picture position, and stores the depth distance as a depth picture (201) with depth information, wherein the depth distance is consistent with the size of an original picture, and the picture pixel value corresponds to the relative distance;

training a neural network model (300), inputting the depth picture (201) into the neural network model (300) for training, and obtaining and storing trained neural network parameters through iterative training;

the depth estimation is carried out, the camera module (100) collects actual pictures, the actual pictures are input into the trained neural network model (300) to be calculated, and an estimated depth distance map is obtained; the neural network model (300) comprises a convolutional neural network (301), a feature fusion network layer (302) and a 3D convolutional neural network layer (303), comprising the training steps,

the image acquired by the camera module (100) is used as input;

obtaining a characteristic diagram of the two diagrams through a convolutional neural network (301);

the output of the convolutional neural network layer (301) is used as the input of the feature fusion network layer (302) to extract fusion features; putting the depth map into a 3D convolutional neural network layer (303) to extract a depth map; the acquired depth picture (201) is put into the neural network model (300), and picture features are extracted through the convolutional neural network (301); inputting the three-dimensional image to the feature fusion network layer (302) for feature fusion, and matching and fusing the related features to generate a 3D feature map;

the convolutional neural network (301) comprises a training step,

simultaneously placing two depth pictures (201) into a residual network layer of the convolutional neural network (301) to extract a picture feature map;

the feature map is placed into a space pyramid pooling layer for feature enhancement, so that richer feature information is obtained;

the feature fusion network layer (302) comprises the steps of,

the features extracted by the convolution layer of the convolutional neural network (301) are used as inputs;

inputting the rich characteristic information content of the convolutional depth fusion layer;

the depth information fusion layer generates an information layer with depth information matching.

2. The method of deep learning based binocular depth vision estimation of claim 1, wherein: the model training may further comprise the step of,

3D convolution is carried out on a 3D feature map, the convolution kernel size is 3 multiplied by 3, fusion features of positions and depths on the 3D feature map are obtained, the feature map is 1/4 of the original size, therefore, the picture is up-sampled to the original size to obtain a depth picture consistent with the picture size, a group of depth signals with the size of D=48 are corresponding to each pixel point in the picture, normalization is carried out on the group of signals, and the function is defined as follows:

v is the corresponding depth signal and S is the normalized depth signal:

3. The method of deep learning based binocular depth vision estimation of claim 2, wherein: further comprising the step of a network update phase,

repeating the above process until the network parameter update is smaller, and repeatedly performing the training for more than one time to obtain no better test result, namely judging that the training tends to be saturated, and finishing the training.

4. A method of deep learning based binocular depth vision estimation according to claim 3, wherein: the 3D convolutional neural network layer (303) comprises the steps of,

-taking as input the output of the feature fusion network layer (302);

inputting a Hourgass module to extract richer deep high-dimensional information;

obtaining a depth module with the size of the original image through the up-sampling layer;

the size is d×w×h, where the meaning is a graph with D sizes, and assuming that the pixel value of the ith graph Wj, hk is Aijk, the output on the corresponding depth graph is:

Djk＝∑Aijk*i(i＝0,1,2......D)。

5. a system employing the deep learning based binocular depth vision estimation method of any one of claims 1 to 4, characterized in that: comprises an image pickup module (100), a depth generation module (200) and a neural network model (300); the camera module (100) is a camera fixedly arranged on the binocular camera and is used for acquiring pictures with two different visual angles; the depth generation module (200) generates a depth picture (201) according to the acquired picture, and the relative distance between the depth picture (201) and the camera corresponds to the pixel value of the image; the neural network model (300) uses the acquired pictures to perform deep learning to save neural network parameters for the generation of an estimated depth distance map.

6. The deep learning based binocular depth vision estimation system of claim 5, wherein: the camera module (100) comprises 2 groups of cameras, namely a color binocular camera and a gray binocular camera, wherein the color binocular camera is used for collecting pictures and is used for training as the input of a neural network, and the gray binocular camera is used for generating a depth map because of better contrast and resolution.