CN109978936A

CN109978936A - Parallax picture capturing method, device, storage medium and equipment

Info

Publication number: CN109978936A
Application number: CN201910243260.4A
Authority: CN
Inventors: 揭泽群
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-03-28
Filing date: 2019-03-28
Publication date: 2019-07-05
Anticipated expiration: 2039-03-28
Also published as: CN109978936B

Abstract

This application discloses a kind of parallax picture capturing method, device, storage medium and equipment, belong to machine learning techniques field.Method includes: multiple visual point images for obtaining object space region；For any one image pair in multiple visual point images, feature extraction is successively carried out to the left image of image pair and right image respectively based on first network, first network includes multiple non-empty convolutional layers, obtains the left characteristic pattern and right characteristic pattern of multiple non-empty convolutional layer outputs；The the first left characteristic pattern and the first right characteristic pattern of the non-empty convolutional layer output of m-th based on first network, obtain the first disparity map, M is positive integer；Obtain the second left characteristic pattern and the second right characteristic pattern of M-1 non-empty convolutional layer output；According to the first disparity map, the second left characteristic pattern and the second right characteristic pattern, the second disparity map is obtained using the second network, the second network includes multiple empty convolutional layers with different empty multiplying powers.The application can get accurate disparity map.

Description

Parallax picture capturing method, device, storage medium and equipment

Technical field

This application involves machine learning techniques field, in particular to a kind of parallax picture capturing method, device, storage medium and Equipment.

Background technique

Stereoscopic vision is a kind of important form of machine vision, and with BSV, (Binocular Stereo Vision, binocular are vertical Body vision) for, BSV be it is a kind of based on principle of parallax and using camera shooting the same space object left and right two open viewpoint figure Picture, the position offset between corresponding points (Correspondence) by calculating two visual point images in left and right, to obtain this The method of the three-dimensional geometric information of space object.

Continue by taking BSV as an example, wherein above-mentioned position offset is parallax (disparity), it is assumed that on the space object Any point M left and right two visual point images on subpoint be respectively that ML and MR, then ML and MR are referred to as corresponding points, and ask The process for taking the corresponding points of two visual point images in left and right is Stereo matching.A kind of expression way is changed, with Stereo Matching Algorithm Disparity map can be obtained.Wherein, disparity map is and the ruler on the basis of any one visual point image in two visual point images in left and right The image that very little size is equal to benchmark image, element value is parallax value.

The relevant technologies often utilize the neural fusion based on propagated forward when obtaining disparity map.This kind of mode without Method effectively captures the background information of pixel, causes the disparity map accuracy got not good enough, and then such as based on disparity map There can be inadequate accurately defect when carrying out estimation of Depth to space object.For this purpose, how to obtain accurate disparity map, become Those skilled in the art's problem urgently to be resolved.

Summary of the invention

The embodiment of the present application provides a kind of parallax picture capturing method, device, storage medium and equipment, solves related skill The inaccurate problem of the disparity map that art is got.The technical solution is as follows:

On the one hand, a kind of parallax picture capturing method is provided, which comprises

Obtain multiple visual point images in object space region；

For any one image pair in multiple described visual point images, based on first network respectively to described image centering Left image and right image successively carry out feature extraction, the first network includes multiple non-empty convolutional layers, is obtained described more The left characteristic pattern and right characteristic pattern of a non-empty convolutional layer output；

The the first left characteristic pattern and the first right characteristic pattern of the non-empty convolutional layer output of m-th based on the first network, The first disparity map is obtained, M is positive integer；

Obtain the second left characteristic pattern and the second right characteristic pattern of M-1 non-empty convolutional layer output；

According to first disparity map, the second left characteristic pattern and the second right characteristic pattern, obtained using the second network The second disparity map is taken, second network includes multiple empty convolutional layers with different empty multiplying powers.

On the other hand, a kind of disparity map acquisition device is provided, described device includes:

First obtains module, for obtaining multiple visual point images in object space region；

Extraction module, for being distinguished based on first network for any one image pair in multiple described visual point images Left image and right image to described image centering successively carry out feature extraction, and the first network includes multiple non-empty convolution Layer；

Second obtains module, for obtaining the left characteristic pattern and right characteristic pattern of the multiple non-empty convolutional layer output；

Described second obtains module, is also used to the first of the non-empty convolutional layer output of the m-th based on the first network Left characteristic pattern and the first right characteristic pattern obtain the first disparity map, and M is positive integer；

Described second obtains module, is also used to obtain the second left characteristic pattern and the of the non-empty convolutional layers output of M-1 Two right characteristic patterns；

Described second obtains module, is also used to according to first disparity map, the second left characteristic pattern and described second Right characteristic pattern obtains the second disparity map using the second network, and second network includes multiple skies with different empty multiplying powers Hole convolutional layer.

In one possible implementation, described second module is obtained, is also used to obtain M-2 non-empty convolutional layers The left characteristic pattern of the third of output and the right characteristic pattern of third；According to second disparity map, the left characteristic pattern of the third and described Two right characteristic patterns obtain third disparity map using second network；It repeats according to previous non-empty convolutional layer output Left characteristic pattern and the corresponding disparity map of the non-empty convolutional layer of right characteristic pattern, the latter, using second network, described in acquisition The process of the corresponding disparity map of previous non-empty convolutional layer, until the non-empty convolutional layer of the n-th of the first network, N are Positive integer less than M；Using the corresponding disparity map of the non-cavity convolutional layer of the n-th as described image to final disparity map.

In one possible implementation, described second module is obtained, be also used to the described first left characteristic pattern and institute It states the first right characteristic pattern and misplaces pixel-by-pixel and be connected, obtain the first tensor；Dimension-reduction treatment is carried out to first tensor, is obtained described First disparity map.

In one possible implementation, described second module is obtained, is also used to carry out first disparity map Sampling processing；Mapping processing is carried out to the described second right characteristic pattern according to first disparity map, obtains mappings characteristics figure；By institute State the up-sampling result of the first disparity map, the second left characteristic pattern is connected on channel dimension with the mappings characteristics figure, obtain To the intermediate features figure.

In one possible implementation, it is described second obtain module, be also used to obtain the assemblage characteristic with it is described Second result of product of the second weight tensor；Dimension-reduction treatment is carried out to second result of product, obtains second disparity map.

In one possible implementation, described second module is obtained, is also used to successively pass through the assemblage characteristic Global pool layer, the first full articulamentum, the first active coating, the second full articulamentum and the processing of the second active coating, obtain described second Weight tensor.

In one possible implementation, described device further include:

Third obtain module, for obtains shooting the left image and two cameras of the right image between away from From and the camera focal length；

4th obtains module, for obtaining third result of product of the distance with the focal length, by the third product As a result the ratio of disparity map corresponding with the non-cavity convolutional layer of the n-th of the first network, as the object space region The depth value for the space object for inside including.

On the other hand, provide a kind of storage medium, be stored at least one instruction in the storage medium, it is described at least One instruction is loaded by processor and is executed to realize above-mentioned parallax picture capturing method.

On the other hand, a kind of disparity map acquisition equipment is provided, the equipment includes processor and memory, the storage At least one instruction is stored in device, at least one instruction is loaded by the processor and executed to realize above-mentioned parallax Picture capturing method.

On the other hand, a kind of disparity map acquisition system is provided, the system comprises cameras and disparity map to obtain equipment；

The camera obtains multiple visual point images for shooting to object space region；

It includes processor and memory that the disparity map, which obtains equipment, and at least one instruction is stored in the memory, At least one instruction is loaded by the processor and is executed to realize:

Multiple described visual point images are obtained, for any one image pair in multiple described visual point images, are based on first The left image to described image centering and right image successively carry out feature extraction to network respectively, and the first network includes multiple non- Empty convolutional layer obtains the left characteristic pattern and right characteristic pattern of the multiple non-empty convolutional layer output；

Technical solution provided by the embodiments of the present application has the benefit that

After getting multiple visual point images, for any one image pair in multiple visual point images, the application is implemented Example first passes around first network i.e. feature extraction network and successively extracts to the left image of the image pair and right image respectively, And then the characteristic information extracted based on feature extraction network and the second network i.e. attention network obtains disparity map, by Include multiple empty convolutional layers with different empty multiplying powers in the second network, therefore can effectively capture the background letter of pixel Breath, so it is more accurate based on the disparity map that above-mentioned network structure is got, such as space object is being carried out based on disparity map Also can be more accurate when estimation of Depth, this kind of acquisition modes effect is preferable.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is the convolution operation schematic diagram of different empty multiplying powers provided by the embodiments of the present application；

Fig. 2 is a kind of schematic diagram of implementation environment that parallax picture capturing method is related to provided by the embodiments of the present application；

Fig. 3 is a kind of schematic diagram of network structure that parallax picture capturing method is related to provided by the embodiments of the present application；

Fig. 4 is a kind of schematic diagram of binocular camera camera shooting binocular vision image provided by the embodiments of the present application；

Fig. 5 is a kind of schematic diagram for obtaining parallax provided by the embodiments of the present application；

Fig. 6 is a kind of flow chart of parallax picture capturing method provided by the embodiments of the present application；

Fig. 7 is the schematic diagram of a kind of binocular vision image and disparity map provided by the embodiments of the present application；

Fig. 8 is a kind of structural schematic diagram of feature extraction network provided by the embodiments of the present application；

Fig. 9 is a kind of structural schematic diagram of attention network provided by the embodiments of the present application；

Figure 10 is a kind of flow chart of parallax picture capturing method provided by the embodiments of the present application；

Figure 11 is that the attention under a kind of left image provided by the embodiments of the present application and different resolution is tried hard to；

Figure 12 is a kind of flow chart of parallax picture capturing method provided by the embodiments of the present application；

Figure 13 is a kind of flow chart of parallax picture capturing method provided by the embodiments of the present application；

Figure 14 is a kind of flow chart of parallax picture capturing method provided by the embodiments of the present application；

Figure 15 is a kind of structural schematic diagram of disparity map acquisition device provided by the embodiments of the present application；

Figure 16 is the structural schematic diagram that a kind of disparity map provided by the embodiments of the present application obtains equipment；

Figure 17 is the structural schematic diagram that a kind of disparity map provided by the embodiments of the present application obtains equipment.

Specific embodiment

To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.

Before to the application carrying out that explanation is explained in detail, some nouns that first the embodiment of the present application may relate to It is explained.

Attention mechanism (Attention Mechanism): derived from the research to human vision.In cognitive science, by In the bottleneck of information processing, the mankind can selectively pay close attention to a part of all information, while ignore other visible information.On The mechanism of stating is commonly known as attention mechanism.

Attention mechanism is brain signal treatment mechanism specific to human vision.Human vision is global by quickly scanning Image obtains the target area for needing to pay close attention to namely ' s focus of attention, then provides to the more attentions of this regional inputs Source to obtain the detailed information for the target for more needing to pay close attention to, and inhibits other garbages.

To sum up, mainly there are two aspects for attention mechanism: first is that determining needs pay close attention to which part of input；Second is that being assigned The messaging resource of limit is to part and parcel.

Wherein, the attention mechanism in deep learning is inherently made peace the selective visual attention power mechanism class of the mankind Seemingly, core objective is also to comform to select the information more crucial to current task in multi information.

Stereo matching: it is always the research hotspot of multi-vision visual, and camera shoots multiple viewpoints of the same space object Image obtains disparity map with Stereo Matching Algorithm.

Wherein, visual point image refers to the image that shooting the same space object obtains with different view herein.

By taking binocular stereo vision as an example, the purpose of Stereo matching is to find the corresponding points of two visual point images in left and right, is led to The position offset between the corresponding points for calculating two visual point images in left and right is crossed, to obtain parallax namely position offset is Parallax.

In addition, it is assumed that subpoint of any point M on two visual point images in left and right on the space object is respectively ML And MR, then ML and MR is referred to as corresponding points, and the process for seeking corresponding points between two visual point images in left and right is Stereo matching.

Disparity map: it is and size etc. on the basis of any one visual point image in two visual point images in left and right It is same as benchmark image, the image that element value is parallax value.

Binocular camera: referring to tool, there are two the video cameras of camera, and the two cameras theoretically want the same, Spacing between two cameras is usually in 10cm between 20cm, and the embodiment of the present application is to this without specifically limiting.

Space object: in the embodiment of the present application, space object, which can refer to, to be appeared in binocular camera viewfinder range Scape or object, such as the bushes of road both sides, road separator or along billboard, vehicle, pedestrian of road etc., the application is implemented Example is to this without specifically limiting.

Object Depth: depth perception or distance perception can be also referred to as.Wherein, Object Depth can reflect space object distance The near-far situation of binocular camera.For example, the depth value of space object A is greater than another space object B, that is, show space object Distance of the A apart from binocular camera is greater than space object B.

Wherein, the application range for being able to reflect the depth map of Object Depth is very extensive, since it has recorded space object Distance apart from video camera, therefore can be used under the scenes such as measurement, three-dimensional reconstruction and virtual view synthesis, the application is real Example is applied to this without specifically limiting.

Receptive field (reception field): in convolutional neural networks, an element in a certain layer output result is determined The area size of corresponding input layer, is called receptive field.

It wherein, is that receptive field is one of a certain layer output result in convolutional neural networks with the language expression of mathematics A element corresponds to a mapping of input layer.

More popular explanation again is the region on the corresponding input figure of a point on characteristic pattern (feature map).It needs It is noted that in particular to be input figure, rather than original graph.

Empty convolution: in image segmentation field, image is input to convolutional neural networks (typically such as full convolutional network) In after, full convolutional network is the same with traditional convolutional neural networks, first to image execute convolution carry out pond (pooling) again, Increase receptive field while reducing picture size, but since image segmentation prediction is the output of pixel-wise, so will Lesser picture size is upsampled to original picture size and is predicted after pooling, and pooling before is operated so that every A pixel (pixel) prediction can see larger receptive field information.

Therefore for image segmentation there are two key, one is pooling reduction picture size increase receptive field, the other is Upsampling enlarged image size.Formerly reduce during increasing picture size again, there are some information that can be lost, be Can also there be biggish receptive field by pooling operation, propose empty convolution.

Referring to Fig. 1, the empty convolution operation of empty multiplying power (dilated rate) equal to 1,2 and 3 when is respectively illustrated.Its In, the left figure in Fig. 1 corresponds to the cavity the 1-dilated convolution operation of 3x3 convolution kernel size, and the convolution operation is the same as common convolution Operation is the same.

Middle figure in Fig. 1 corresponds to the cavity the 2-dilated convolution operation of 3x3 convolution kernel size, actual convolution kernel size Or 3x3, but cavity is 1, that is to say, that for the image patch of a 7x7 size, at only 9 black square blocks Convolution operation occurs for the convolution kernel of feature and 3x3 size, remaining is skipped over.It can be appreciated that the size of convolution kernel is 7x7, but It is weight at 9 black square blocks only in figure is not 0, remaining is 0.It can see by middle figure, although convolution kernel Size only has 3x3, but the receptive field size of this convolution has had increased to 7x7.

Wherein, the right figure in Fig. 1 corresponds to the cavity the 3-dilated convolution operation of 3x3 convolution kernel size.

The implementation environment being related to below to parallax picture capturing method provided by the embodiments of the present application is introduced.

Referring to fig. 2, which includes that camera 201 and disparity map obtain equipment 202.

As an example, camera 201 can be AI (Artificial Intelligence, artificial intelligence) field The camera being equipped in robot or automatic driving vehicle, the embodiment of the present application is to this without specifically limiting.That is, the application Embodiment provide parallax picture capturing method can be applied to the field AI, such as apply the intelligent robot in the field AI or nobody drive Sail vehicle.A kind of expression way is changed, the application scenarios of the embodiment of the present application include but is not limited to such as unmanned or robot Deng under the scene for simulating human intelligence.Wherein, AI is current research and development for simulating, extending and extending mankind's intelligence One emerging science and technology of energy, has been widely used in the fields such as image procossing, recognition of face, game, medical treatment.

As shown in Fig. 2, disparity map obtains equipment 202 and camera 201 keeps communicating to connect, camera 201 is for adopting After collecting multiple visual point images, multiple collected visual point images are based on the communication connection and are transferred to disparity map acquisition equipment 202.Wherein, which may be either to be wirelessly connected, or wired connection, the embodiment of the present application is to this without specific It limits.Wherein, disparity map obtains equipment 202 and is used to export disparity map based on multiple the collected visual point images of camera 201.

It is the computer equipment with machine learning ability that disparity map, which obtains equipment 202, for example, the computer equipment can be with It is the stationary computers equipment such as PC, server, can also be tablet computer, smart phone, E-book reader etc. Mobile Computing machine equipment, the embodiment of the present application is to this without specifically limiting.

In one possible implementation, disparity map obtains and is provided with network structure as shown in Figure 3 in equipment 202. The network structure is a kind of Stereo matching network structure based on attention mechanism, and referring to Fig. 3, the input of the network structure is to take the photograph As the visual point image (for example visual point image is opened in the left and right two that an image pair includes) that first 201 shooting obtains, the network structure Output be disparity map with the pixel scale of the sizes such as visual point image.Referring to Fig. 3, which includes feature extraction network With attention network.

It should be noted that being with left image herein for two visual point images of any one image pair Appellation is carried out with right image, i.e., since the shooting visual angle of two visual point images of an image pair is different, in the text Any one image is named two visual point images for including with left image and right image.Wherein, left image refers to Shooting visual angle image more to the left, right image refer to shooting visual angle image more to the right.

For example, referring to fig. 4, the camera of left and right two for first passing through camera 201 shoots the more of some space object Open visual point image；Then, Stereo matching, matching image centering are carried out using vision picture capturing method provided by the embodiments of the present application Position of each point on another visual point image on one visual point image, obtains the offset between two positions It (shift) is parallax, as shown in Figure 5.

That is, the application mainly completes obtaining to the space object horizontal parallax of multiple visual point images by above-mentioned network structure It takes.As an example, on this basis, multiplying for the spacing and focal length between the camera of shooting visual point image is recycled Product, the depth value of space object can be obtained divided by the parallax got.A kind of expression way is changed, the parallax of pixel scale is obtained After figure, negate than being then the Object Depth figure of pixel scale.Wherein, the value of each point represents the sky being taken on Object Depth figure Between object to camera 201 distance.

In other words, the application can provide the Object Depth estimation function of the facilities such as robot or automatic driving vehicle.Its In, the facilities such as robot or automatic driving vehicle equipped with camera shoot to obtain multiple viewpoints by camera in real time After image, the distance of space object to robot or automatic driving vehicle can be estimated using this programme, can survey Away from this has conclusive effect to the automatic Pilot of automatic driving vehicle and the normal work of robot etc..

Based on above description, the application utilizes a kind of solid matching method based on attention mechanism, by parallel and more The empty convolutional neural networks of empty multiplying power carry out feature learning to input picture.

In addition, the application can also calculate an attention weight map, each channel of the attention weight map is represented The weight coefficient of the output feature of one specific empty multiplying power convolutional layer；Later, the output of different empty multiplying power convolutional layers is special It seeks peace the stack combinations operation being added again after the attention weight map is multiplied, obtains adaptive assemblage characteristic, in this way can So that network model executes the empty convolution of different empty multiplying powers to different pixels point self-adaptedly, adaptively obtain different big Small background information can automatically determine each pixel be most suitable for its semantic receptive field in this way.

On the other hand, the application is adaptively logical to difference using channel attention (channel-wise attention) Road is weighted combination, and targetedly the feature of some channels description is emphasized and is inhibited, more accurately carries out feature Study, final prediction obtain the disparity map of accurate pixel scale.

And the relevant technologies due to do not have the empty convolutional neural networks using parallel and more empty multiplying powers to input picture into Row feature learning, receptive field is relatively small, therefore can not effectively capture the background information of pixel.

And empty convolutional neural networks of the application due to using parallel and more empty multiplying powers, it can be by different Empty multiplying power captures different size of background information, and bigger empty multiplying power receptive field is bigger, thus can capture bigger back Scape information.In addition, the application makes different pixels under different empty multiplying powers also to different empty multiplying power adaptive learning weights Weighted array it is different, this feature combination for individually learning optimum empty multiplying power to each pixel can be avoided The shortcomings that because ignoring difference between different pixels caused by all empty multiplying power simple averages.On the other hand, the application can also Adaptively the feature in different channels is combined based on channel attention, avoids carrying out simple Horizon to all channel characteristics It operates, this kind of mode, which can be selected more intelligently, emphasizes useful channel characteristics, and inhibits useless channel characteristics.

Below parallax picture capturing method provided by the embodiments of the present application is carried out that explanation is explained in detail.

The point for needing to illustrate is the similar this description such as first, second, third, fourth occurred in following embodiments, only It is for distinguishing different objects, without constituting any other particular determination to each object.

Fig. 6 is a kind of flow chart of parallax picture capturing method provided by the embodiments of the present application.The executing subject of this method is Disparity map shown in Figure 2 obtains equipment 202.With two visual point images of photographic subjects area of space, respectively left image and For right image, referring to Fig. 6, method flow provided by the embodiments of the present application includes:

601, the left image and right image shot to object space region is obtained.

Wherein, object space region, which refers to, covers the region in the viewfinder range of video camera.

In the embodiment of the present application, being shot to object space region is binocular camera, can also be taken the photograph to be same As head shoots object space region in different perspectives, can also be independent two cameras to object space region into Row shooting, the embodiment of the present application is to this without specifically limiting.

By taking camera is provided on automatic driving vehicle as an example, then object space region can be the space region of vehicle front Domain, the space object in the area of space includes but is not limited to: moving traffic, pedestrian, road both sides trees, Traffic mark board, billboard etc..By taking binocular camera is provided in robot as an example, then object space region can be for before robot The area of space of side, the space object in the area of space includes but is not limited to: people, stationary body or dynamic object etc..

As shown in fig. 7, for the same space region, since camera site, that is, shooting visual angle of camera is different, because This left image and right image are not consistent, and the two angle is different.

602, feature extraction is successively carried out to left image and right image respectively based on first network, first network includes multiple Non- cavity convolutional layer obtains the left characteristic pattern and right characteristic pattern of multiple non-empty convolutional layer outputs.

This step is that the left image for obtaining shooting and right image input first network carry out feature extraction, first network Referred to herein as feature extraction network.That is, feature extraction network utilizes depth convolution using left and right original image as input Neural network successively extracts the feature of left and right original image.

Wherein, feature extraction network can be the network structures such as ResNet, GoogleNet, the embodiment of the present application to this not into Row is specific to be limited.

As an example, Fig. 8 shows the specific structure of feature extraction network.Wherein, the convolutional layer in first network It is common non-empty convolutional layer.For convolutional layer conv0, specific structure is [3*3,32] * 3, resolution ratio H/2*W/2； For convolutional layer conv1, conv2, conv3 and conv4, specific structure difference is as follows:

For convolutional layer conv1, conv2, conv3 and conv4, resolution ratio is respectively H/2*W/2, H/4*W/4, H/8* W/8、H/16*W/16。

Wherein, H and W is respectively the height and width of left and right original image.3*3 refers to convolution kernel size, in bracket most Latter position refers to output channel number, the number of a reference structure outside bracket；For convolutional layer conv1, including 3 It is aStructure.

In the embodiment of the present application, the feature that feature extraction obtains will be carried out to the left image of input and is referred to as left feature Figure is referred to as right characteristic pattern for the feature that feature extraction obtains is carried out to the right image of input；Referring to Fig. 8, feature extraction network Convolutional layer conv1, conv2, conv3 and conv4, each convolutional layer can export a left characteristic pattern and right feature Figure.

Wherein, the left characteristic pattern and right characteristic pattern of convolutional layer conv1 output are respectively f_L1And f_R1, convolutional layer conv2 output Left characteristic pattern and right characteristic pattern be respectively f_L2And f_R2, the left characteristic pattern and right characteristic pattern of convolutional layer conv3 output are respectively f_L3 And f_R3, the left characteristic pattern and right characteristic pattern of convolutional layer conv4 output are respectively f_L4And f_R4。

603, the first left characteristic pattern and the first right characteristic pattern of the non-empty convolutional layer output of m-th based on first network, Obtain the first disparity map.

In the embodiment of the present application, as an example, the non-empty convolutional layer of m-th is non-for the last one of first network Empty convolutional layer.

Continue with Fig. 8 as an example, then the last one non-empty convolutional layer is convolutional layer conv4.

In the embodiment of the present application, the first left characteristic pattern and the first right feature based on the non-empty convolutional layer output of m-th Figure obtains the first disparity map, including but not limited to: the first left characteristic pattern and the first right characteristic pattern being misplaced pixel-by-pixel and are connected, is obtained To the first tensor；Dimension-reduction treatment is carried out to the first tensor, obtains the first disparity map, i.e. initial parallax figure.

Wherein, dislocation is connected pixel-by-pixel can both be directed to X-axis, can also be directed to Y-axis, the embodiment of the present application is to this without specific It limits.By taking X-axis as an example, the every layer of feature obtained based on feature extraction network is connected and can be obtained by the dislocation pixel-by-pixel in X-axis To a 4D tensor.

The element that row is corresponded in connected as two characteristic patterns of dislocation pixel-by-pixel in X-axis carries out dislocation and is connected, such as left The 1st element of the first row is connected with the 0th element of the first row in right characteristic pattern in characteristic pattern, the first row in left characteristic pattern 2nd element is connected with the 1st element of the first row in right characteristic pattern, the 3rd right feature of element of the first row in left characteristic pattern The 2nd element of the first row is connected in figure, and so on.

To left characteristic pattern f_L4With right characteristic pattern f_R4For dislocation is connected pixel-by-pixel in X-axis, then a dimension can be obtained For 2c* (d_max/ 16) 4D (4Dimensions, four-dimensional) tensor of * (H/16) * (W/16), wherein c f_L4And f_R4Channel dimension Degree, i.e. channel number, d_maxFor the preset parallax value upper limit.

As an example, when carrying out dimension-reduction treatment, which can be inputted 3D (3Dimen sions, three Dimension) convolutional layer obtains a 3D tensor, then, then by the 3D tensor by a 2D (2Dimen sions, two dimension) convolutional layer The disparity map d that a dimension is (H/16) * (W/16) can be obtained₄.Wherein, disparity map can be as shown in Figure 7.

604, the second left characteristic pattern and the second right characteristic pattern of M-1 non-empty convolutional layer output are obtained.

Continue with Fig. 8 as an example, then M-1 non-empty convolutional layers are convolutional layer conv3, the second left characteristic pattern is For f_L3, the second right characteristic pattern is f_R3。

605, according to the first disparity map, the second left characteristic pattern and the second right characteristic pattern, the second view is obtained using the second network Difference figure, the second network includes a non-empty convolutional layer, parallel and with different empty multiplying powers multiple empty convolutional layers.

It, can be parallel by characteristic use and mostly empty by attention network shown in Fig. 9 after obtaining initial parallax figure The empty convolutional neural networks and adaptive weighting prediction technique of multiplying power carry out more cavities times of differentiation to different pixels The convolution weighted array of rate cavity.

It as an example, include parallel and the convolutional layers of four different empty multiplying powers referring to Fig. 9, in attention network, RespectivelyWithThe corresponding empty multiplying power of this four convolutional layers is respectively r₀、r₁、r₂、 r₃.In addition, further including a non-empty convolutional layer in the attention power module

In the embodiment of the present application, according to the first disparity map, the second left characteristic pattern and the second right characteristic pattern, using the second net Network obtains the second disparity map, including but not limited to: according to the first disparity map, the second left characteristic pattern and the second right characteristic pattern, obtaining Intermediate features figure；The first processing is carried out based on characteristic pattern between the second Internet on middle, obtains the second disparity map.That is, in obtaining Between characteristic pattern be input in the second network, finally the output result based on the second network obtains the second disparity map.

As an example, according to the first disparity map, the second left characteristic pattern and the second right characteristic pattern, intermediate features figure is obtained Mode, including but not limited to: to the first disparity map carry out up-sampling treatment；According to the first disparity map to the second right characteristic pattern into Row mapping processing, obtains mappings characteristics figure；Later, the up-sampling result of the first disparity map, the second left characteristic pattern and mapping is special It levies figure to be connected on channel dimension, obtains intermediate features figure.

For example, as shown in figure 9, first by disparity map d₄Up-sampling obtains the d having a size of (H/16) * (W/16)₄ ^up, Obtain left characteristic pattern f_L3With right characteristic pattern f_R3, then by right characteristic pattern f_R3According to disparity map d₄It obtains being mapped to left characteristic pattern f_L3's Characteristic pattern f_L3 ^W, f_L3 ^WAs above-mentioned mappings characteristics figure；Then, by f_L3、f_L3 ^WAnd d₄ ^upIt is connected on channel dimension, obtains f₃ ^c, f₃ ^cAs above-mentioned intermediate features figure.And f₃ ^cThe as input of the second network.

In the embodiment of the present application, referring to Figure 10, the first processing is carried out based on characteristic pattern between the second Internet on middle, obtains the Two disparity maps, including but not limited to following step:

605a, based on the empty convolutional layer of each of second network, convolution behaviour is executed to the intermediate features figure of input respectively Make, obtains and multiple empty consistent multiple characteristic tensors of convolution number of layers for including in the second network.

As an example, as shown in figure 9, by f₃ ^cThe empty convolutional Neural net of input one parallel more empty multiplying powers Network, such as the cavity convolutional neural networks are set with 4 different empty multiplying powers, respectively r₀、r₁、r₂、r₃, parallel to utilize 4 convolutional layers with different empty multiplying powers are respectively to f₃ ^cConvolution operation is executed, obtaining 4 dimensions is 4* (H/8) *'s (W/8) TensorWherein the value of i is 1 to 4.Wherein, 4 different empty multiplying power r₀、r₁、r₂、r₃Corresponding different size of cavity Interval.

605b, the first power is obtained to intermediate characteristic pattern execution convolution operation based on the non-empty convolutional layer in the second network The amount of re-opening.

Wherein, a channel of the first weight tensor represents the characteristic tensor of the empty convolutional layer output of an empty multiplying power Weight coefficient.

As an example, as shown in figure 9, by f₃ ^cInput one common non-empty convolutional layer, obtaining a dimension is The weight tensor of 4* (H/8) * (W/8)Wherein, each of 4 channels of weight tensor channel represents a spy The weight coefficient of the output feature of the empty convolutional layer of fixed cavity multiplying power.

605c, for each feature vector, obtain the first result of product of this feature tensor and the first weight tensor；To All the first result of product arrived execute phase add operation, obtain assemblage characteristic.

As an example, as shown in figure 9, for any one element, by weight tensorThe picture on 4 channels The weight coefficient of element, respectively after the characteristic tensor progress multiplication operations of the empty convolutional layer output of empty multiplying powers different from 4 again It is added, obtains adaptive assemblage characteristic, which is indicated then with mathematic(al) representation as following formula:

605d, second processing is carried out to assemblage characteristic, obtains the second weight tensor.

As an example, as shown in figure 9, the channel attention mechanism that introduces of the application is by feature z obtained above₃Through A channel weight vectors s can be obtained after crossing the processing of several convolutional layers and active coating₃, the channel weight vectors are herein Referred to as the second weight tensor.

In one possible implementation, referring to Fig. 9, assemblage characteristic can successively be passed through to global pool layer, first entirely Articulamentum, the first active coating (such as relu), the second full articulamentum and the second active coating (such as sigmoid), obtain the second power The amount of re-opening.

605e, according to assemblage characteristic and the second weight tensor, obtain the second disparity map.

In the embodiment of the present application, according to assemblage characteristic and the second weight tensor, the second disparity map is obtained, including but unlimited In: obtain the second result of product of assemblage characteristic and the second weight tensor；Dimension-reduction treatment is carried out to the second result of product, obtains the Two disparity maps.

As an example, as shown in figure 9, the adaptive assemblage characteristic z that will be obtained₃With channel weight vectors s₃It is mutually multiplied To final attention featureIt can keep the resolution sizes of (H/8) * (W/8).WithChannel dimension For 128, thenIt is the tensor of 128* (H/8) * (W/8) size.

Wherein, original image (upper left), the resolution ratio that Figure 11 respectively illustrates the left camera shooting of binocular camera are (H/ 8) (lower-left) is tried hard in attention when (upper right) is tried hard in attention when * (W/8), resolution ratio is (H/4) * (W/4) and resolution ratio is (H/2) attention when (W/2) (bottom right) * is tried hard to.

It is obtainingAfterwards, dimension-reduction treatment is carried out using a 2D convolutional layer, (H/8) * (W/8) resolution ratio can be obtained The disparity map d of size₃。

In the alternatively possible implementation, as shown in figure 9, will obtainBeing input to again can in attention network It is rightIt is updated, obtainsAnd so on, and then the disparity map d updated₃。

In alternatively possible implementation, referring to Figure 12, after obtaining the first disparity map and the second disparity map, this Shen Please embodiment further include following steps:

606, the left characteristic pattern of third and the right characteristic pattern of third of M-2 non-empty convolution output are obtained；According to the second view Difference figure, the left characteristic pattern of third and the second right characteristic pattern obtain third disparity map using the second network.

By taking Fig. 8 as an example, disparity map d corresponding with convolutional layer conv4 is obtained first₄；Later, according to disparity map d₄, to obtain Disparity map d corresponding with convolutional layer conv3₃；And then according to disparity map d₃, to obtain parallax corresponding with convolutional layer conv2 Scheme d₂。

607, the left characteristic pattern and right characteristic pattern, the latter non-empty according to previous non-empty convolutional layer output are repeated The corresponding disparity map of hole convolutional layer, to obtain the process of the corresponding disparity map of previous non-cavity convolutional layer, until first network The non-empty convolutional layer of n-th.

Wherein, N is the positive integer less than M.As an example, network structure shown in Figure 8, the value of N are 2.

In fig. 8 since convolutional layer conv1 is second non-empty convolutional layer, according to disparity map d₂Get with The corresponding disparity map d of convolutional layer conv1₁After terminate.I.e. continuous application attention network carries out the iterative process taken turns more to obtain To disparity map d₁。

608, it is exported using the corresponding disparity map of the non-cavity convolutional layer of n-th as final disparity map.

As an example, as shown in figure 8, due to disparity map d₁、d₂、d₃And d₄In, disparity map d₁Resolution ratio highest, be (H/2) * (W/2), therefore by disparity map d₁The output final as attention network.Wherein, the effect of visualization of disparity map is such as Shown in Fig. 7.

It should be noted that the above-mentioned disparity map d referred to₁、d₂、d₃And d₄It can both include left mesh disparity map, also include right mesh Disparity map, the embodiment of the present application are only illustrated by taking a wherein mesh disparity map as an example.

Method provided by the embodiments of the present application, after getting the left image and right image in object space region, the application Embodiment first passes around first network i.e. feature extraction network and successively extracts to left image and right image respectively, and then The characteristic information and the second network i.e. the attention network that are extracted based on feature extraction network obtains disparity map, due to the second net Network includes multiple empty convolutional layers parallel and with different empty multiplying powers, therefore can effectively capture the background letter of pixel Breath, so it is more accurate based on the disparity map that above-mentioned network structure is got, such as space object is being carried out based on disparity map Also can be more accurate when estimation of Depth, this kind of acquisition modes effect is preferable.

A kind of expression way is changed, due to the empty convolutional neural networks using parallel and more empty multiplying powers, can be led to Different empty multiplying powers is crossed to capture different size of background information, bigger empty multiplying power receptive field is bigger, thus can capture Bigger background information.

In addition, the application makes different pixels in different empty multiplying powers also to different empty multiplying power adaptive learning weights Under weighted array it is different, this feature combination for individually learning optimum empty multiplying power to each pixel can be kept away The shortcomings that exempting to ignore caused by because of all empty multiplying power simple averages difference between different pixels.

In addition, the application adaptively can also be combined the feature in different channels based on channel attention, avoid Simply average operation is carried out to all channel characteristics, this kind of mode, which can be selected more intelligently, emphasizes that useful channel is special Sign, and inhibit useless channel characteristics.

It should be noted that above-described embodiment is only by taking two visual point images of photographic subjects area of space as an example, to parallax The acquisition modes of figure are illustrated.In another embodiment, the embodiment of the present application is also supported to obtain object space region Multiple visual point images simultaneously carry out disparity map calculating based on multiple visual point images.That is, referring to Figure 13, view provided by the embodiments of the present application Poor picture capturing method includes:

1301, multiple visual point images in object space region are obtained.

Wherein, multiple visual point images can shoot object space region from different visual angles for camera, For example from left to right or from right to left shoot the visual point image of several or more than ten or tens.

In the embodiments of the present disclosure, multiple visual point images can be shot to obtain by multi-lens camera, can also be by parallel multiple Camera shoots to obtain, and can be also divided by a camera and repeatedly move in parallel shooting and obtain, the embodiment of the present application to this not into Row is specific to be limited.

1302, for any one image pair in multiple visual point images, based on first network respectively to the image pair Left image and right image successively carry out feature extraction, first network includes multiple non-empty convolutional layers, obtains multiple non-cavities The left characteristic pattern and right characteristic pattern of convolutional layer output.

As an example, the composable image pair of any two visual point images in multiple visual point images, and for Any one image pair can obtain disparity map by step 1302 to step 1308.

1303, the first left characteristic pattern and the first right characteristic pattern of the non-empty convolutional layer output of m-th based on first network, Obtain the first disparity map.

1304, the second left characteristic pattern and the second right characteristic pattern of M-1 non-empty convolutional layer output are obtained；According to first Disparity map, the second left characteristic pattern and the second right characteristic pattern obtain the second disparity map using the second network, and the second network includes having Multiple empty convolutional layers of different cavity multiplying powers.

In addition, further including following steps after getting the first disparity map referring to Figure 14:

1305, the left characteristic pattern of third and the right characteristic pattern of third of M-2 non-empty convolutional layer output are obtained.

1306, according to the second disparity map, the left characteristic pattern of third and the second right characteristic pattern, third view is obtained using the second network Difference figure.

1307, it repeats non-according to the left characteristic pattern and right characteristic pattern, the latter of previous non-empty convolutional layer output The corresponding disparity map of empty convolutional layer obtains the process of the corresponding disparity map of previous non-empty convolutional layer using the second network, Until the non-empty convolutional layer of the n-th of first network.

1308, using the corresponding disparity map of the non-cavity convolutional layer of n-th as the image to final disparity map.

Method provided by the embodiments of the present application, after getting multiple visual point images, for appointing in multiple visual point images It anticipates an image pair, the embodiment of the present application first passes around first network i.e. feature extraction network respectively to the left figure of the image pair Picture and right image successively extract, and then the characteristic information extracted based on feature extraction network and the second network are infused Power network anticipate to obtain disparity map, since the second network includes multiple empty convolutional layers with different empty multiplying powers, energy Enough background informations for effectively capturing pixel, so it is more accurate based on the disparity map that above-mentioned network structure is got, such as Also can be more accurate when based on disparity map to space object progress estimation of Depth, this kind of acquisition modes effect is preferable.

In another embodiment, by taking tri-item stereo vision as an example, due to obtaining viewpoint from three visual angles by camera Image, therefore three visual point images can be obtained, the corresponding shooting visual angle of each visual point image.

Assuming that referring to the three viewpoint figures shot from left to right respectively with visual point image A, visual point image B and visual point image C Visual point image A, visual point image B then can be formed an image pair, and to the image to progress centered on visual point image B by picture Stereo matching obtains the disparity map between visual point image A and visual point image B；In addition, can also be by visual point image B, visual point image C forms an image pair, and obtains the parallax between visual point image B and visual point image C to Stereo matching is carried out to the image Figure.

As another example, can also Stereo matching, the embodiment of the present application pair be carried out to visual point image A and visual point image C This is without specifically limiting.

In another embodiment, by taking five item stereo visions as an example, due to obtaining viewpoint from five visual angles by camera Image, therefore five visual point images can be obtained, the corresponding shooting visual angle of each visual point image.

Assuming that being referred to respectively with visual point image A, visual point image B, visual point image C, visual point image D and visual point image E from a left side To five visual point images of right shooting, then visual point image A, visual point image C can be formed into a figure centered on visual point image C The disparity map between visual point image A and visual point image C is obtained to Stereo matching is carried out as right, and to the image；In addition, also Visual point image B, visual point image C can be formed to an image pair, and to the image to Stereo matching is carried out, i.e. acquisition visual point image Disparity map between B and visual point image C；In addition, also visual point image C, visual point image D can be formed an image pair, and to this Image obtains the disparity map between visual point image C and visual point image D to Stereo matching is carried out；In addition, can also be by visual point image C, visual point image E forms an image pair, and to the image to Stereo matching is carried out, i.e. acquisition visual point image C and visual point image E Between disparity map.

As another example, other images pair can also be obtained and obtain the disparity map between other images pair, this Apply embodiment to this without specifically limiting.

In another embodiment, it for multi-view stereo vision, after getting multiple disparity maps, can be based respectively on each Disparity map obtains depth information.Alternatively, can also be incited somebody to action according to fusion criterion after matching to obtain multiple disparity maps based on multi-eye stereo Multiple disparity map fusions become a disparity map, and obtain depth information, the embodiment of the present application pair based on fused disparity map This is without specifically limiting.

In another embodiment, after obtaining disparity map, estimation of Depth can also be carried out to object based on disparity map, and The depth value of estimation is more accurate.Wherein, depth estimation procedure are as follows: obtain shooting image pair two cameras between away from From and camera focal length；The third result of product for obtaining distance and focal length, by third result of product with the image to final Disparity map ratio, the depth value as the space object for including in object space region.It should be noted that above-mentioned two What the distance between camera referred to is the distance between camera optical center.

In other words, the application can provide the Object Depth estimation function of the facilities such as robot or automatic driving vehicle.Its In, the facilities such as robot or automatic driving vehicle equipped with video camera shoot to obtain visual point image by video camera in real time Afterwards, the distance of space object to robot or automatic driving vehicle can be estimated using this programme, can ranging, this The normal work etc. of automatic Pilot and robot to automatic driving vehicle has conclusive effect.

Figure 15 is a kind of structural schematic diagram of disparity map acquisition device provided by the embodiments of the present application.Referring to Figure 15, the dress It sets and includes:

First obtains module 1501, for obtaining multiple visual point images in object space region；

Extraction module 1502, for being based on first network for any one image pair in multiple described visual point images The left image to described image centering and right image successively carry out feature extraction respectively, and the first network includes multiple non-cavities Convolutional layer；

Second obtains module 1503, is also used to the first of the non-empty convolutional layer output of the m-th based on the first network Left characteristic pattern and the first right characteristic pattern obtain the first disparity map, and M is positive integer；

Second obtains module 1503, is also used to obtain the second left characteristic pattern and the of the non-empty convolutional layers output of M-1 Two right characteristic patterns；

Second obtains module 1503, is also used to according to first disparity map, the second left characteristic pattern and described second Right characteristic pattern obtains the second disparity map using the second network, and second network includes multiple skies with different empty multiplying powers Hole convolutional layer.

Device provided by the embodiments of the present application, after getting multiple visual point images, for appointing in multiple visual point images It anticipates an image pair, the embodiment of the present application first passes around first network i.e. feature extraction network respectively to the left figure of the image pair Picture and right image successively extract, and then the characteristic information extracted based on feature extraction network and the second network are infused Power network anticipate to obtain disparity map, since the second network includes multiple empty convolutional layers with different empty multiplying powers, energy Enough background informations for effectively capturing pixel, so it is more accurate based on the disparity map that above-mentioned network structure is got, such as Also can be more accurate when based on disparity map to space object progress estimation of Depth, this kind of acquisition modes effect is preferable.

In one possible implementation, second module 1503 is obtained, is also used to obtain M-2 non-empty convolutional layers The left characteristic pattern of the third of output and the right characteristic pattern of third；According to second disparity map, the left characteristic pattern of the third and described Two right characteristic patterns obtain third disparity map using second network；It repeats according to previous non-empty convolutional layer output Left characteristic pattern and the corresponding disparity map of the non-empty convolutional layer of right characteristic pattern, the latter, using second network, described in acquisition The process of the corresponding disparity map of previous non-empty convolutional layer, until the non-empty convolutional layer of the n-th of the first network, N are Positive integer less than M；Using the corresponding disparity map of the non-cavity convolutional layer of the n-th as described image to final disparity map.

In one possible implementation, second module 1503 is obtained, be also used to according to first disparity map, described Second left characteristic pattern and the second right characteristic pattern obtain intermediate features figure；Based on second network to the intermediate features Figure carries out the first processing, obtains second disparity map.

In one possible implementation, second module 1503 is obtained, be also used to based on parallel and with different cavities Multiple empty convolutional layers of multiplying power, execute convolution operation to the intermediate features figure respectively, obtain and the multiple empty convolution The consistent multiple characteristic tensors of number of layers；Based on the non-empty convolutional layer in second network, the intermediate features figure is held Row convolution operation, obtains the first weight tensor, and a channel of the first weight tensor represents the cavity of an empty multiplying power The weight coefficient of the characteristic tensor of convolutional layer output；For each feature vector in the multiple characteristic tensor, described in acquisition First result of product of characteristic tensor and the first weight tensor；Obtained all the first result of product are executed and are added behaviour Make, obtains assemblage characteristic；Second processing is carried out to the assemblage characteristic, obtains the second weight tensor；According to the assemblage characteristic With the second weight tensor, second disparity map is obtained.

In one possible implementation, second module 1503 is obtained, be also used to the described first left characteristic pattern and institute It states the first right characteristic pattern and misplaces pixel-by-pixel and be connected, obtain the first tensor；Dimension-reduction treatment is carried out to first tensor, is obtained described First disparity map.

In one possible implementation, second module 1503 is obtained, is also used to carry out first disparity map Sampling processing；Mapping processing is carried out to the described second right characteristic pattern according to first disparity map, obtains mappings characteristics figure；By institute State the up-sampling result of the first disparity map, the second left characteristic pattern is connected on channel dimension with the mappings characteristics figure, obtain To the intermediate features figure.

In one possible implementation, second obtain module 1503, be also used to obtain the assemblage characteristic with it is described Second result of product of the second weight tensor；Dimension-reduction treatment is carried out to second result of product, obtains second disparity map.

In one possible implementation, second module 1503 is obtained, is also used to successively pass through the assemblage characteristic Global pool layer, the first full articulamentum, the first active coating, the second full articulamentum and the processing of the second active coating, obtain described second Weight tensor.

In one possible implementation, the device further include:

All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination It repeats one by one.

It should be understood that disparity map acquisition device provided by the above embodiment is when obtaining disparity map, only with above-mentioned each The division progress of functional module can according to need and for example, in practical application by above-mentioned function distribution by different function Energy module is completed, i.e., the internal structure of device is divided into different functional modules, to complete whole described above or portion Divide function.In addition, disparity map acquisition device provided by the above embodiment and parallax picture capturing method embodiment belong to same design, Its specific implementation process is detailed in embodiment of the method, and which is not described herein again.

Figure 16 shows the structural block diagram that the disparity map that one exemplary embodiment of the application provides obtains equipment 1600.It should Equipment 1600 can be portable mobile termianl, such as: smart phone, tablet computer, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) player, pen Remember this computer or desktop computer.Equipment 1600 is also possible to referred to as user equipment, portable terminal, laptop terminal, desk-top end Other titles such as end.

In general, equipment 1600 includes: processor 1601 and memory 1602.

Processor 1601 may include one or more processing cores, such as 4 core processors, 8 core processors etc..Place Reason device 1601 can use DSP (Digital Signal Processing, Digital Signal Processing), FPGA (Field- Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, may be programmed Logic array) at least one of example, in hardware realize.Processor 1601 also may include primary processor and coprocessor, master Processor is the processor for being handled data in the awake state, also referred to as CPU (Central Processing Unit, central processing unit)；Coprocessor is the low power processor for being handled data in the standby state.? In some embodiments, processor 1601 can be integrated with GPU (Graphics Processing Unit, image processor), GPU is used to be responsible for the rendering and drafting of content to be shown needed for display screen.In some embodiments, processor 1601 can also be wrapped AI (Artificial Intelligence, artificial intelligence) processor is included, the AI processor is for handling related machine learning Calculating operation.

Memory 1602 may include one or more computer readable storage mediums, which can To be non-transient.Memory 1602 may also include high-speed random access memory and nonvolatile memory, such as one Or multiple disk storage equipments, flash memory device.In some embodiments, the non-transient computer in memory 1602 can Storage medium is read for storing at least one instruction, at least one instruction performed by processor 1601 for realizing this Shen Please in embodiment of the method provide parallax picture capturing method.

In some embodiments, equipment 1600 is also optional includes: peripheral device interface 1603 and at least one periphery are set It is standby.It can be connected by bus or signal wire between processor 1601, memory 1602 and peripheral device interface 1603.It is each outer Peripheral equipment can be connected by bus, signal wire or circuit board with peripheral device interface 1603.Specifically, peripheral equipment includes: In radio circuit 1604, touch display screen 1605, camera 1606, voicefrequency circuit 1607, positioning component 1608 and power supply 1609 At least one.

Peripheral device interface 1603 can be used for I/O (Input/Output, input/output) is relevant outside at least one Peripheral equipment is connected to processor 1601 and memory 1602.In some embodiments, processor 1601, memory 1602 and periphery Equipment interface 1603 is integrated on same chip or circuit board；In some other embodiments, processor 1601, memory 1602 and peripheral device interface 1603 in any one or two can be realized on individual chip or circuit board, this implementation Example is not limited this.

Radio circuit 1604 is for receiving and emitting RF (Radio Frequency, radio frequency) signal, also referred to as electromagnetic signal. Radio circuit 1604 is communicated by electromagnetic signal with communication network and other communication equipments.Radio circuit 1604 is by telecommunications Number being converted to electromagnetic signal is sent, alternatively, the electromagnetic signal received is converted to electric signal.Optionally, radio circuit 1604 include: antenna system, RF transceiver, one or more amplifiers, tuner, oscillator, digital signal processor, volume solution Code chipset, user identity module card etc..Radio circuit 1604 can by least one wireless communication protocol come with it is other Terminal is communicated.The wireless communication protocol includes but is not limited to: WWW, Metropolitan Area Network (MAN), Intranet, each third generation mobile communication network (2G, 3G, 4G and 5G), WLAN and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.In some implementations In example, radio circuit 1604 can also include that NFC (Near Field Communication, wireless near field communication) is related Circuit, the application are not limited this.

Display screen 1605 is for showing UI (User Interface, user interface).The UI may include figure, text, Icon, video and its their any combination.When display screen 1605 is touch display screen, display screen 1605 also there is acquisition to exist The ability of the touch signal on the surface or surface of display screen 1605.The touch signal can be used as control signal and be input to place Reason device 1601 is handled.At this point, display screen 1605 can be also used for providing virtual push button and/or dummy keyboard, it is also referred to as soft to press Button and/or soft keyboard.In some embodiments, display screen 1605 can be one, and the front panel of equipment 1600 is arranged；Another In a little embodiments, display screen 1605 can be at least two, be separately positioned on the different surfaces of equipment 1600 or in foldover design； In still other embodiments, display screen 1605 can be flexible display screen, is arranged on the curved surface of equipment 1600 or folds On face.Even, display screen 1605 can also be arranged to non-rectangle irregular figure, namely abnormity screen.Display screen 1605 can be with Using LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) etc. materials preparation.

CCD camera assembly 1606 is for acquiring image or video.Optionally, CCD camera assembly 1606 includes front camera And rear camera.In general, the front panel of terminal is arranged in front camera, the back side of terminal is arranged in rear camera.? In some embodiments, rear camera at least two is that main camera, depth of field camera, wide-angle camera, focal length are taken the photograph respectively As any one in head, to realize that main camera and the fusion of depth of field camera realize background blurring function, main camera and wide Pan-shot and VR (Virtual Reality, virtual reality) shooting function or other fusions are realized in camera fusion in angle Shooting function.In some embodiments, CCD camera assembly 1606 can also include flash lamp.Flash lamp can be monochromatic temperature flash of light Lamp is also possible to double-colored temperature flash lamp.Double-colored temperature flash lamp refers to the combination of warm light flash lamp and cold light flash lamp, can be used for Light compensation under different-colour.

Voicefrequency circuit 1607 may include microphone and loudspeaker.Microphone is used to acquire the sound wave of user and environment, and It converts sound waves into electric signal and is input to processor 1601 and handled, or be input to radio circuit 1604 to realize that voice is logical Letter.For stereo acquisition or the purpose of noise reduction, microphone can be separately positioned on the different parts of equipment 1600 to be multiple. Microphone can also be array microphone or omnidirectional's acquisition type microphone.Loudspeaker is then used to that processor 1601 or radio frequency will to be come from The electric signal of circuit 1604 is converted to sound wave.Loudspeaker can be traditional wafer speaker, be also possible to piezoelectric ceramics loudspeaking Device.When loudspeaker is piezoelectric ceramic loudspeaker, the audible sound wave of the mankind can be not only converted electrical signals to, can also be incited somebody to action Electric signal is converted to the sound wave that the mankind do not hear to carry out the purposes such as ranging.In some embodiments, voicefrequency circuit 1607 may be used also To include earphone jack.

Positioning component 1608 is used for the current geographic position of positioning device 1600, to realize navigation or LBS (Location Based Service, location based service).Positioning component 1608 can be the GPS (Global based on the U.S. Positioning System, global positioning system), China dipper system or Russia Galileo system positioning group Part.

Power supply 1609 is used to be powered for the various components in equipment 1600.Power supply 1609 can be alternating current, direct current Electricity, disposable battery or rechargeable battery.When power supply 1609 includes rechargeable battery, which can be line charge Battery or wireless charging battery.Wired charging battery is the battery to be charged by Wireline, and wireless charging battery is to pass through The battery of wireless coil charging.The rechargeable battery can be also used for supporting fast charge technology.

In some embodiments, equipment 1600 further includes having one or more sensors 1610.One or more sensing Device 1610 includes but is not limited to: acceleration transducer 1611, gyro sensor 1612, pressure sensor 1613, fingerprint sensing Device 1614, optical sensor 1615 and proximity sensor 1616.

Acceleration transducer 1611 can detecte the acceleration in three reference axis of the coordinate system established with equipment 1600 Size.For example, acceleration transducer 1611 can be used for detecting component of the acceleration of gravity in three reference axis.Processor The 1601 acceleration of gravity signals that can be acquired according to acceleration transducer 1611, control touch display screen 1605 with transverse views Or longitudinal view carries out the display of user interface.Acceleration transducer 1611 can be also used for game or the exercise data of user Acquisition.

Gyro sensor 1612 can detecte body direction and the rotational angle of equipment 1600, gyro sensor 1612 Acquisition user can be cooperateed with to act the 3D of equipment 1600 with acceleration transducer 1611.Processor 1601 is according to gyro sensors The data that device 1612 acquires, following function may be implemented: action induction (for example changing UI according to the tilt operation of user) is clapped Image stabilization, game control and inertial navigation when taking the photograph.

The lower layer of side frame and/or touch display screen 1605 in equipment 1600 can be set in pressure sensor 1613.When When the side frame of equipment 1600 is arranged in pressure sensor 1613, user can detecte to the gripping signal of equipment 1600, by Reason device 1601 carries out right-hand man's identification or prompt operation according to the gripping signal that pressure sensor 1613 acquires.Work as pressure sensor 1613 when being arranged in the lower layer of touch display screen 1605, is grasped by processor 1601 according to pressure of the user to touch display screen 1605 Make, realization controls the operability control on the interface UI.Operability control include button control, scroll bar control, At least one of icon control, menu control.

Fingerprint sensor 1614 is used to acquire the fingerprint of user, is collected by processor 1601 according to fingerprint sensor 1614 Fingerprint recognition user identity, alternatively, by fingerprint sensor 1614 according to the identity of collected fingerprint recognition user.Knowing Not Chu the identity of user when being trusted identity, authorize the user to execute relevant sensitive operation by processor 1601, which grasps Make to include solving lock screen, checking encryption information, downloading software, payment and change setting etc..Fingerprint sensor 1614 can be set Install standby 1600 front, the back side or side.When being provided with physical button or manufacturer Logo in equipment 1600, fingerprint sensor 1614 can integrate with physical button or manufacturer Logo.

Optical sensor 1615 is for acquiring ambient light intensity.In one embodiment, processor 1601 can be according to light The ambient light intensity that sensor 1615 acquires is learned, the display brightness of touch display screen 1605 is controlled.Specifically, work as ambient light intensity When higher, the display brightness of touch display screen 1605 is turned up；When ambient light intensity is lower, the aobvious of touch display screen 1605 is turned down Show brightness.In another embodiment, the ambient light intensity that processor 1601 can also be acquired according to optical sensor 1615, is moved The acquisition parameters of state adjustment CCD camera assembly 1606.

Proximity sensor 1616, also referred to as range sensor are generally arranged at the front panel of equipment 1600.Proximity sensor 1616 for acquiring the distance between the front of user Yu equipment 1600.In one embodiment, when proximity sensor 1616 is examined When measuring the distance between the front of user and equipment 1600 and gradually becoming smaller, by processor 1601 control touch display screen 1605 from Bright screen state is switched to breath screen state；When proximity sensor 1616 detect the distance between front of user and equipment 1600 by When gradual change is big, touch display screen 1605 is controlled by processor 1601 and is switched to bright screen state from breath screen state.

It, can be with it will be understood by those skilled in the art that structure shown in Figure 16 does not constitute the restriction to equipment 1600 Including than illustrating more or fewer components, perhaps combining certain components or being arranged using different components.

Figure 17 is the structural schematic diagram that a kind of disparity map provided by the embodiments of the present application obtains equipment, which can be because Configuration or performance are different and generate bigger difference, may include one or more processors (central Processing units, CPU) 1701 and one or more memory 1702, wherein it is deposited in the memory 1702 At least one instruction is contained, at least one instruction is loaded by the processor 1701 and executed to realize above-mentioned each method The parallax picture capturing method that embodiment provides.Certainly, which can also have wired or wireless network interface, keyboard and defeated Enter the components such as output interface, to carry out input and output, which can also include other components for realizing functions of the equipments.

In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, the memory for example including instruction, Above-metioned instruction can be executed by the processor in terminal to complete the parallax picture capturing method in above-described embodiment.For example, the meter Calculation machine readable storage medium storing program for executing can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage and set It is standby etc..

In addition, the embodiment of the present application also provides a kind of disparity maps to obtain system, which includes camera and Figure 16 Or disparity map shown in Figure 17 obtains equipment.

Wherein, the camera obtains multiple visual point images for shooting to object space region；

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely the preferred embodiments of the application, not to limit the application, it is all in spirit herein and Within principle, any modification, equivalent replacement, improvement and so on be should be included within the scope of protection of this application.

Claims

1. a kind of parallax picture capturing method, which is characterized in that the described method includes:

Obtain multiple visual point images in object space region；

For any one image pair in multiple described visual point images, based on first network respectively to a left side for described image centering Image and right image successively carry out feature extraction, and the first network includes multiple non-empty convolutional layers, obtain the multiple non- The left characteristic pattern and right characteristic pattern of empty convolutional layer output；

The the first left characteristic pattern and the first right characteristic pattern of the non-empty convolutional layer output of m-th based on the first network, obtain First disparity map, M are positive integer；

According to first disparity map, the second left characteristic pattern and the second right characteristic pattern, the is obtained using the second network Two disparity maps, second network include multiple empty convolutional layers with different empty multiplying powers.

2. the method according to claim 1, wherein the method also includes:

Obtain the left characteristic pattern of third and the right characteristic pattern of third of M-2 non-empty convolutional layer output；

According to second disparity map, the left characteristic pattern of the third and the second right characteristic pattern, obtained using second network Take third disparity map；

Repeat the left characteristic pattern and the non-empty convolutional layer of right characteristic pattern, the latter according to previous non-empty convolutional layer output Corresponding disparity map obtains the process of the corresponding disparity map of the previous non-empty convolutional layer, directly using second network To the non-empty convolutional layer of n-th of the first network, N is the positive integer less than M；

Using the corresponding disparity map of the non-cavity convolutional layer of the n-th as described image to final disparity map.

3. the method according to claim 1, wherein described according to first disparity map, the second left spy Sign figure and the described second right characteristic pattern obtain the second disparity map using the second network, comprising:

According to first disparity map, the second left characteristic pattern and the second right characteristic pattern, intermediate features figure is obtained；

The first processing is carried out to the intermediate features figure based on second network, obtains second disparity map.

4. according to the method described in claim 3, it is characterized in that, described be based on second network to the intermediate features figure The first processing is carried out, second disparity map is obtained, comprising:

Based on multiple empty convolutional layers parallel and with different empty multiplying powers, convolution behaviour is executed to the intermediate features figure respectively Make, obtains and the consistent multiple characteristic tensors of the multiple empty convolution number of layers；

Based on the non-empty convolutional layer in second network, convolution operation is executed to the intermediate features figure, obtains the first power The amount of re-opening, a channel of the first weight tensor represent the characteristic tensor of the empty convolutional layer output of an empty multiplying power Weight coefficient；

For each feature vector in the multiple characteristic tensor, the characteristic tensor and the first weight tensor are obtained First result of product；

Phase add operation is executed to obtained all the first result of product, obtains assemblage characteristic；

Second processing is carried out to the assemblage characteristic, obtains the second weight tensor；

According to the assemblage characteristic and the second weight tensor, second disparity map is obtained.

5. the method according to claim 1, wherein the non-cavity volume of the m-th based on the first network The the first left characteristic pattern and the first right characteristic pattern of lamination output, obtain the first disparity map, comprising:

Described first left characteristic pattern is misplaced pixel-by-pixel with the described first right characteristic pattern and is connected, the first tensor is obtained；

Dimension-reduction treatment is carried out to first tensor, obtains first disparity map.

6. according to the method described in claim 3, it is characterized in that, described according to first disparity map, the second left spy Sign figure and the described second right characteristic pattern, obtain intermediate features figure, comprising:

Up-sampling treatment is carried out to first disparity map；

Mapping processing is carried out to the described second right characteristic pattern according to first disparity map, obtains mappings characteristics figure；

By the up-sampling result of first disparity map, the second left characteristic pattern and the mappings characteristics figure on channel dimension It is connected, obtains the intermediate features figure.

7. according to the method described in claim 4, it is characterized in that, described according to the assemblage characteristic and second weight Amount obtains second disparity map, comprising:

Obtain the second result of product of the assemblage characteristic Yu the second weight tensor；

Dimension-reduction treatment is carried out to second result of product, obtains second disparity map.

8. according to the method described in claim 4, it is characterized in that, it is described to the assemblage characteristic carry out second processing, obtain Second weight tensor, comprising:

The assemblage characteristic is successively passed through into global pool layer, the first full articulamentum, the first active coating, the second full articulamentum and The processing of two active coatings, obtains the second weight tensor.

9. according to claim 1 to method described in any claim in 8, which is characterized in that the method also includes:

Obtain the coke of the distance between two cameras for shooting the left image and the right image and the camera Away from；

Third result of product of the distance with the focal length is obtained, by the of the third result of product and the first network The ratio of the corresponding disparity map of N number of non-empty convolutional layer, the depth as the space object for including in the object space region Value.

10. a kind of disparity map acquisition device, which is characterized in that described device includes:

Extraction module, for being based on first network respectively to institute for any one image pair in multiple described visual point images The left image and right image for stating image pair successively carry out feature extraction, and the first network includes multiple non-empty convolutional layers；

Described second obtains module, is also used to the first left spy of the non-empty convolutional layer output of the m-th based on the first network Sign figure and the first right characteristic pattern, obtain the first disparity map, M is positive integer；

Described second obtains module, is also used to obtain the second left characteristic pattern and second right side of M-1 non-empty convolutional layer outputs Characteristic pattern；

Described second obtains module, is also used to according to first disparity map, the second left characteristic pattern and the second right spy Sign figure obtains the second disparity map using the second network, and second network includes multiple cavity volumes with different empty multiplying powers Lamination.

11. device according to claim 10, which is characterized in that described second obtains module, is also used to according to described the One disparity map, the second left characteristic pattern and the second right characteristic pattern obtain intermediate features figure；Based on second network pair The intermediate features figure carries out the first processing, obtains second disparity map.

12. device according to claim 11, which is characterized in that described second obtains module, be also used to based on parallel and Multiple empty convolutional layers with different empty multiplying powers, execute convolution operation to the intermediate features figure respectively, obtain with it is described The consistent multiple characteristic tensors of multiple cavity convolution number of layers；Based on the non-empty convolutional layer in second network, to described Intermediate features figure executes convolution operation, obtains the first weight tensor, a channel of the first weight tensor represents a sky The weight coefficient of the characteristic tensor of the empty convolutional layer output of hole multiplying power；For each feature in the multiple characteristic tensor to Amount, obtains the first result of product of the characteristic tensor Yu the first weight tensor；To obtained all the first result of product Phase add operation is executed, assemblage characteristic is obtained；Second processing is carried out to the assemblage characteristic, obtains the second weight tensor；According to institute Assemblage characteristic and the second weight tensor are stated, second disparity map is obtained.

13. a kind of storage medium, which is characterized in that it is stored at least one instruction in the storage medium, described at least one It instructs as processor loads and executes to realize the disparity map acquisition side as described in any of claim 1 to 9 claim Method.

14. a kind of disparity map obtains equipment, which is characterized in that the equipment includes processor and memory, in the memory It is stored at least one instruction, at least one instruction is loaded by the processor and executed to realize such as claim 1 to 9 Any of parallax picture capturing method described in claim.

15. a kind of disparity map obtains system, which is characterized in that the system comprises cameras and disparity map to obtain equipment；

It includes processor and memory that the disparity map, which obtains equipment, and at least one instruction is stored in the memory, described At least one instruction is loaded by the processor and is executed to realize:

Multiple described visual point images are obtained, for any one image pair in multiple described visual point images, are based on first network The left image to described image centering and right image successively carry out feature extraction respectively, and the first network includes multiple non-cavities Convolutional layer obtains the left characteristic pattern and right characteristic pattern of the multiple non-empty convolutional layer output；